Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints
University of Borås, Swedish School of Library and Information Science.
University of Borås, Swedish School of Library and Information Science.
2011 (English)In: International Journal on Digital Libraries, ISSN 1432-5012, E-ISSN 1432-1300Article in journal (Refereed)
Abstract [en]

Digital libraries increasingly bene t from re- search on automated text categorization for improved access. Such research is typically carried out by using standard test collections. In this paper we present a pilot experiment of replacing such test collections by a set of 6000 objects from a real-world digital repos- itory, indexed by Library of Congress Subject Head- ings, and test support vector machines in a supervised learning setting for their ability to reproduce the exist- ing classi cation. To augment the standard approach, we introduce a combination of two novel elements: us- ing functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classi cation reconstruction from abstracts and vice versa from full-text documents, the latter out- come due to word sense ambiguity. The practical imple- mentation of our methodological framework enhances the analysis and representation of speci c knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of speci c knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital ob- jects and collections). Our research is an initial step in this direction developing further the methodological ap- proach and demonstrating that text categorisation can be applied to analyse the thematic coverage in digital repositories.

Place, publisher, year, edition, pages
2011.
Keyword [en]
kernel methods, text classification, support vector machines, semantic enrichment, hilbert spaces, digital libraries, text categorization, machine learning, analogical information representation, wavelet analysis
National Category
Information Studies Computer and Information Science Language Technology (Computational Linguistics)
Research subject
Library and Information Science
Identifiers
URN: urn:nbn:se:hb:diva-3241Local ID: 2320/9820OAI: oai:DiVA.org:hb-3241DiVA: diva2:871338
Available from: 2015-11-13 Created: 2015-11-13 Last updated: 2017-12-01

Open Access in DiVA

fulltext(329 kB)410 downloads
File information
File name FULLTEXT01.pdfFile size 329 kBChecksum SHA-512
ab228805fd57ce966874262e266a4a4bcb59dc98a26dacc16e8119942e4ece96357028880e88318bd62c329bc5a7f8f893cc0ebf565678776867ba0b50f47927
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Darányi, SándorWittek, Peter
By organisation
Swedish School of Library and Information Science
In the same journal
International Journal on Digital Libraries
Information StudiesComputer and Information ScienceLanguage Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 410 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 720 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf