Automatic Source Code Classification: Classifying Source Code for a Case-Based Reasoning System
Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
This work has investigated the possibility of classifying Java source code into cases for a case-based reasoning system. A Case-Based Reasoning system is a problem solving method in Artificial Intelligence that uses knowledge of previously solved problems to solve new problems. A case in case-based reasoning consists of two parts: the problem part and solution part. The problem part describes a problem that needs to be solved and the solution part describes how this problem was solved. In this work, the problem is described as a Java source file using words that describes the content in the source file and the solution is a classification of the source file along with the source code. To classify Java source code, a classification system was developed. It consists of four analyzers: type filter, documentation analyzer, syntactic analyzer and semantic analyzer. The type filter determines if a Java source file contains a class or interface. The documentation analyzer determines the level of documentation in asource file to see the usefulness of a file. The syntactic analyzer extracts statistics from the source code to be used for similarity, and the semantic analyzer extracts semantics from the source code. The finished classification system is formed as a kd-tree, where the leaf nodes contains the classified source files i.e. the cases. Furthermore, a vocabulary was developed to contain the domain knowledge about the Java language. The resulting kd-tree was found to be imbalanced when tested, as the majority of source files analyzed were placed inthe left-most leaf nodes. The conclusion from this was that using documentation as a part of the classification made the tree imbalanced and thus another way has to be found. This is due to the fact that source code is not documented to such an extent that it would be useful for this purpose.
Place, publisher, year, edition, pages
2015. , 70 p.
Artificial Intelligence, Case-Based Reasoning, CBR, Vocabulary, Classification, Similarity measure, Distance measure, Java, C++
IdentifiersURN: urn:nbn:se:miun:diva-25519OAI: oai:DiVA.org:miun-25519DiVA: diva2:841529
Subject / course
Computer Engineering DT1
Master of Science in Engineering - Computer Engineering TDTEA 300 higher education credits
Zhang, Tingting, Professor