Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
With or without context: Automatic text categorization using semantic kernels
University of Borås, Faculty of Librarianship, Information, Education and IT.
2016 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

In this thesis text categorization is investigated in four dimensions of analysis: theoretically as well as empirically, and as a manual as well as a machine-based process. In the first four chapters we look at the theoretical foundation of subject classification of text documents, with a certain focus on classification as a procedure for organizing documents in libraries. A working hypothesis used in the theoretical analysis is that classification of documents is a process that involves translations between statements in different languages, both natural and artificial. We further investigate the close relationships between structures in classification languages and the order relations and topological structures that arise from classification.

A classification algorithm that gets a special focus in the subsequent chapters is the support vector machine (SVM), which in its original formulation is a binary classifier in linear vector spaces, but has been extended to handle classification problems for which the categories are not linearly separable. To this end the algorithm utilizes a category of functions called kernels, which induce feature spaces by means of high-dimensional and often non-linear maps. For the empirical part of this study we investigate the classification performance of semantic kernels generated by different measures of semantic similarity. One category of such measures is based on the latent semantic analysis and the random indexing methods, which generates term vectors by using co-occurrence data from text collections. Another semantic measure used in this study is pointwise mutual information. In addition to the empirical study of semantic kernels we also investigate the performance of a term weighting scheme called divergence from randomness, that has hitherto received little attention within the area of automatic text categorization.

The result of the empirical part of this study shows that the semantic kernels generally outperform the “standard” (non-semantic) linear kernel, especially for small training sets. A conclusion that can be drawn with respect to the investigated datasets is therefore that semantic information in the kernel in general improves its classification performance, and that the difference between the standard kernel and the semantic kernels is particularly large for small training sets. Another clear trend in the result is that the divergence from randomness weighting scheme yields a classification performance surpassing that of the common tf-idf weighting scheme.

Place, publisher, year, edition, pages
Högskolan i Borås, 2016. , 300 p.
Series
Skrifter från Valfrid, ISSN 1103-6990 ; 60
Keyword [en]
automatic text categorization, subject classification, machine learning, computational linguistics, support vector machines, semantic kernels, term weighting, divergence from randomness
National Category
Information Studies
Research subject
Library and Information Science
Identifiers
URN: urn:nbn:se:hb:diva-8949ISBN: 978-91-981654-8-7 (print)ISBN: 978-91-981654-9-4 (print)OAI: oai:DiVA.org:hb-8949DiVA: diva2:906045
Public defence
2016-04-15, C203, Allégatan 1, Borås, 13:00
Available from: 2016-02-24 Created: 2016-02-23 Last updated: 2016-03-23Bibliographically approved

Open Access in DiVA

cover(4709 kB)97 downloads
File information
File name COVER01.pdfFile size 4709 kBChecksum SHA-512
160e886aa5624f1a63d49a95f9b9689d0bba14dd5ef84bfec6fb954e7e5cf86356663cb1997bbe0ca4366c5e987da625688b7f7f47bda7b71fcbe27f1f6b273f
Type coverMimetype application/pdf
spikblad(45 kB)62 downloads
File information
File name SPIKBLAD01.pdfFile size 45 kBChecksum SHA-512
178c1f93766219a7641c0a436fad95571bb588f763a46c563fc805ee799a6f6357ddc133134aaff1e29525c64184523d0c8b3decdddacde036894efd323665ea
Type spikbladMimetype application/pdf
fulltext(1388 kB)666 downloads
File information
File name FULLTEXT01.pdfFile size 1388 kBChecksum SHA-512
6f124006cafb01bbe558c3b5c08f96460c1478e68c19369a49f77c67756b16f7ae5955a3cdb0dd898d82e94a0ff75303562c72edd09bea0e220cdb9636a31a9a
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Eklund, Johan
By organisation
Faculty of Librarianship, Information, Education and IT
Information Studies

Search outside of DiVA

GoogleGoogle Scholar
Total: 666 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 5779 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf