Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Towards automated learning from software development issues: Analyzing open source project repositories using natural language processing and machine learning techniques
Linnaeus University, Faculty of Technology, Department of Media Technology.
2017 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

This thesis presents an in-depth investigation on the subject of how natural language processing and machine learning techniques can be utilized in order to perform a comprehensive analysis of programming issues found in different open source project repositories hosted on GitHub. The research is focused on examining issues gathered from a number of JavaScript repositories based on their user generated textual description. The primary goal of the study is to explore how natural language processing and machine learning methods can facilitate the process of identifying and categorizing distinct issue types. Furthermore, the research goes one step further and investigates how these same techniques can support users in searching for potential solutions to these issues.

For this purpose, an initial proof-of-concept implementation is developed, which collects over 30 000 JavaScript issues from over 100 GitHub repositories. Then, the system extracts the titles of the issues, cleans and processes the data, before supplying it to an unsupervised clustering model which tries to uncover any discernible similarities and patterns within the examined dataset. What is more, the main system is supplemented by a dedicated web application prototype, which enables users to utilize the underlying machine learning model in order to find solutions to their programming related issues.

Furthermore, the developed implementation is meticulously evaluated through a number of measures. First of all, the trained clustering model is assessed by two independent groups of external reviewers - one group of fellow researchers and another group of practitioners in the software industry, so as to determine whether the resulting categories contain distinct types of issues. Moreover, in order to find out if the system can facilitate the search for issue solutions, the web application prototype is tested in a series of user sessions with participants who are not only representative of the main target group which can benefit most from such a system, but who also have a mixture of both practical and theoretical backgrounds.

The results of this research demonstrate that the proposed solution can effectively categorize issues according to their type, solely based on the user generated free-text title. This provides strong evidence that natural language processing and machine learning techniques can be utilized for analyzing issues and automating the overall learning process. However, the study was unable to conclusively determine whether these same methods can aid the search for issue solutions. Nevertheless, the thesis provides a detailed account of how this problem was addressed and can therefore serve as the basis for future research.

Place, publisher, year, edition, pages
2017. , 177 p.
Keyword [en]
machine learning, natural language processing, document clustering, issue categorization, issue classification, issue analysis, solution suggestions, open source, GitHub, project repositories
National Category
Media and Communication Technology
Identifiers
URN: urn:nbn:se:lnu:diva-66834OAI: oai:DiVA.org:lnu-66834DiVA: diva2:1120721
Subject / course
Media Technology
Educational program
Social Media and Web Technologies, Master Programme, 120 credits
Presentation
2017-05-30, D2272, Växjö, 10:00 (English)
Supervisors
Examiners
Available from: 2017-07-10 Created: 2017-07-07 Last updated: 2017-07-10Bibliographically approved

Open Access in DiVA

fulltext(4089 kB)47 downloads
File information
File name FULLTEXT01.pdfFile size 4089 kBChecksum SHA-512
f0c496bfd9a511288553d0c3471f8c719d231fc2e36528b925f6871594f8e65fd2b7cc06d437397b80533745c7e76e71cb1fdb27214674bf6b19fe59261e15b4
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Salov, Aleksandar
By organisation
Department of Media Technology
Media and Communication Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 47 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 86 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf