Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Investigating the Practicality of Just-in-time Defect Prediction with Semi-supervised Learning on Industrial Commit Data
KTH, School of Electrical Engineering and Computer Science (EECS).
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Some of the challenges faced with Just-in-time defect (JIT) prediction are achieving high performing models and obtaining large quantities of labelled data. There is also a limited number of studies that actually test the effectiveness of software defect prediction models in practice.

In this thesis, the performance of five notable classification algorithms is investigated when applied to Just-in-time defect prediction. The utility of semi-supervised techniques such as the self-training algorithm is also explored. In order to test the viability of JIT defect prediction models in practice, a case study was set up at King, a game development company. Finally, to have a better understanding of how software developers at King identify and resolve bugs, a series of interviews were conducted.

The investigation found that ensemble learning models such as XGBoost can outperform deep learning approaches such as Deeper. The self-training algorithm can be used to train on labelled and unlabelled data and still achieve similar performance to purely supervised approaches. The case study found that although a JIT defect prediction model based on random forests could achieve better performance than a random model, there is still a large discrepancy between the cross validation performance and the performance in practice. Finally, the interviews found that developers rely on inspecting builds, manual debugging and version control tools to identify bugs. Additionally, the interviews found that risky code tends to have high dependency on other code, is difficult to comprehend and does not follow proper coding practices.

Abstract [sv]

Några utmaningar med Just-in-time defect prediction är att uppnå modeller med hög förmåga att identifiera riskabla commits och att ta fram stora kvantiteter av etiketterad data. Ytterligare en utmaning är att det finns ett begränsat antal studier som testar effektiviteten av modeller för programvarufel i praktiken.

I detta examensarbete undersöks resultatet av fem olika klassificeringsalgoritmer som tillämpas på Just-in-time (JIT) defect prediction. Även användningen av halvövervakade metoder så som self-training-algoritmen utforskas. För att testa lönsamheten av JITmodeller i praktiken har en fallstudie genomförts hos King, ett spelutvecklingsföretag. För att få en bättre förståelse för hur programmerare hos King identifierar och löser buggar genomfördes flera intervjuer.

Undersökningen visade att ensemble learning-modeller som XGBoost kan ge bättre resultat än deep learning-metoder som Deeper. Self-training-algoritmen kan används för att träna på etiketterad och oetiketterad data och kan fortfarande uppnå liknande resultat som supervised learning-metoder. Fallstudien visar att även om en JIT-modell baserad på Random Forest skulle kunna uppnå bättre prestanda än en slumpmässig modell, finns det fortfarande en stor skillnad mellan förmågan som visas från korsvalidering än det som visas i praktiken. Det framkom från intervjuerna att programmerare i nuläget förlitar sig på inspektioner, manuella felsökningar och verktyg för versionshantering för att identifiera fel. Slutligen framkom det att riskabel kod ofta har många beroenden av annan kod, är svår att förstå och samt ofta inte följer riktiga kodningsförfaranden.

Place, publisher, year, edition, pages
2019. , p. 50
Series
TRITA-EECS-EX ; 2019:433
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-254955OAI: oai:DiVA.org:kth-254955DiVA, id: diva2:1336751
Subject / course
Computer Science
Educational program
Master of Science - Computer Science
Supervisors
Examiners
Available from: 2019-07-10 Created: 2019-07-10 Last updated: 2019-07-10Bibliographically approved

Open Access in DiVA

fulltext(902 kB)46 downloads
File information
File name FULLTEXT02.pdfFile size 902 kBChecksum SHA-512
40ec812c75a70fedbe6fdd443c2f8c65a02bb62e311336eee94fc1e42e8166864d2a43c49e66901a5a44fb4cb18a0b82d95a6f54b8b2806c708d131747c3b50b
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 46 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 158 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf