Mining Git Repositories: An introduction to repository mining
Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
When performing an analysis of the evolution of software quality and software metrics,there is a need to get access to as many versions of the source code as possible. There isa lack of research on how data or source code can be extracted from the source controlmanagement system Git. This thesis explores different possibilities to resolve thisproblem.
Lately, there has been a boom in usage of the version control system Git. Githubalone hosts about 6,100,000 projects. Some well known projects and organizations thatuse Git are Linux, WordPress, and Facebook. Even with these figures and clients, thereare very few tools able to perform data extraction from Git repositories. A pre-studyshowed that there is a lack of standardization on how to share mining results, and themethods used to obtain them.
There are several tools available for older version control systems, such as concurrentversions system (CVS), but few for Git. The examined repository mining applicationsfor Git are either poorly documented; or were built to be very purpose-specific to theproject for which they were designed.
This thesis compiles a list of general issues encountered when using repositorymining as a tool for data gathering. A selection of existing repository mining tools wereevaluated towards a set of prerequisite criteria. The end result of this evaluation is thecreation of a new repository mining tool called Doris. This tool also includes a smallcode metrics analysis library to show how it can be extended.
Place, publisher, year, edition, pages
2013. , 28 p.
repository mining, msr, git, quality analysis, version control system, vcs, source control management, scm, data mining, data extraction
IdentifiersURN: urn:nbn:se:lnu:diva-27742OAI: oai:DiVA.org:lnu-27742DiVA: diva2:638844
Subject / course
2013-06-03, 18:35 (English)
Toll, Daniel, Lecturer
Hedenborg, Mathias, Doctor