Comparison of Automatic Classifiers’ Performances using Word-based Feature Extraction Techniques in an E-government setting
Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Nowadays email is commonly used by citizens to establish communication with their government. On the received emails, governments deal with some common queries and subjects which some handling officers have to manually answer. Automatic email classification of the incoming emails allows to increase the communication efficiency by decreasing the delay between the query and its response.
This thesis takes part within the IMAIL project, which aims to provide an automatic answering solution to the Swedish Social Insurance Agency (SSIA) (“Försäkringskassan” in Swedish). The goal of this thesis is to analyze and compare the classification performance of different sets of features extracted from SSIA emails on different automatic classifiers. The features extracted from the emails will depend on the previous preprocessing that is carried out as well. Compound splitting, lemmatization, stop words removal, Part-of-Speech tagging and Ngrams are the processes used in the data set. Moreover, classifications will be performed using Support Vector Machines, k- Nearest Neighbors and Naive Bayes. For the analysis and comparison of different results, precision, recall and F-measure are used.
From the results obtained in this thesis, SVM provides the best classification with a F-measure value of 0.787. However, Naive Bayes provides a better classification for most of the email categories than SVM. Thus, it can not be concluded whether SVM classify better than Naive Bayes or not.
Furthermore, a comparison to Dalianis et al. (2011) is made. The results obtained in this approach outperformed the results obtained before. SVM provided a F-measure value of 0.858 when using PoS-tagging on original emails. This result improves by almost 3% the 0.83 obtained in Dalianis et al. (2011). In this case, SVM was clearly better than Naive Bayes.
Place, publisher, year, edition, pages
2011. , 64 p.
E-government, machine learning, WEKA, SVM, Naive Bayes, kNN, Swedish, PoStagging, feature extraction, feature selection, automatic e-mail classification
Computer and Information Science
IdentifiersURN: urn:nbn:se:kth:diva-32363OAI: oai:DiVA.org:kth-32363DiVA: diva2:410293
Dalianis, Hercules, Univ. lektor