Text to features for Swedish text
2019 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE credits
Student thesis
Abstract [en]
In text mining, texts are usually transformed into numerical vectors or feature vectors, before they are given to a machine learning algorithm for text classification. In this project, a set of features for classifying tweets in Swedish was created. The following classification tasks were selected: gender, age and political party prediction, sentiment analysis and authorship attribution, which is the task of determining if a text was written by a particular author or not. Relevant previous studies were researched and a suitable subset of features used in those studies were chosen. A tool was developed that preprocesses the tweets and calculates, for each tweet, values for the features in the feature set. Experiments were run on a data set consisting of tweets written by Swedish politicians. The output of the tool was given to a machine learning algorithm that created classification models. While the first four classification tasks were unsuccessful, some of the authorship attribution models managed to produce an F-score between 80 and 90%. For the failed classification tasks, the features need to be tested on a different data set or new features have to be created
Place, publisher, year, edition, pages
2019. , p. 42
Series
IT ; 19038
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:uu:diva-396578OAI: oai:DiVA.org:uu-396578DiVA, id: diva2:1368316
Educational program
Bachelor Programme in Computer Science
Supervisors
Examiners
2019-11-062019-11-062019-11-06Bibliographically approved