Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Study on Text Classification Methods and Text Features
Linköping University, Department of Computer and Information Science.
2019 (English)Independent thesis Basic level (degree of Bachelor), 12 credits / 18 HE creditsStudent thesis
Abstract [en]

When it comes to the task of classification the data used for training is the most crucial part. It follows that how this data is processed and presented for the classifier plays an equally important role. This thesis attempts to investigate the performance of multiple classifiers depending on the features that are used, the type of classes to classify and the optimization of said classifiers. The classifiers of interest are support-vector machines (SMO) and multilayer perceptron (MLP), the features tested are word vector spaces and text complexity measures, along with principal component analysis on the complexity measures. The features are created based on the Stockholm-Umeå-Corpus (SUC) and DigInclude, a dataset containing standard and easy-to-read sentences. For the SUC dataset the classifiers attempted to classify texts into nine different text categories, while for the DigInclude dataset the sentences were classified into either standard or simplified classes. The classification tasks on the DigInclude dataset showed poor performance in all trials. The SUC dataset showed best performance when using SMO in combination with word vector spaces. Comparing the SMO classifier on the text complexity measures when using or not using PCA showed that the performance was largely unchanged between the two, although not using PCA had slightly better performance

Place, publisher, year, edition, pages
2019. , p. 29
Keywords [en]
NLP, Text Classification, SVM, MLP, PCA, SUC, DigInclude
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:liu:diva-159992ISRN: LIU-IDA/KOGVET-G--19/025--SEOAI: oai:DiVA.org:liu-159992DiVA, id: diva2:1347558
Subject / course
Cognitive science
Supervisors
Examiners
Available from: 2019-09-05 Created: 2019-09-01 Last updated: 2019-09-05Bibliographically approved

Open Access in DiVA

fulltext(218 kB)15 downloads
File information
File name FULLTEXT01.pdfFile size 218 kBChecksum SHA-512
35e5f956d6eebf38b77f7325b5f63a807f54a36bd7e3acd07226a19f77202cffc03899fe51bd1b69b368d890b9f98a3778965527a99c14018dbcde85eb0ca14a
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Danielsson, Benjamin
By organisation
Department of Computer and Information Science
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 15 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 54 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf