Change search
ReferencesLink to record
Permanent link

Direct link
Using wavelet analysis for text categorization in digital libraries: a first experiment with Strathprints
University of Borås, Swedish School of Library and Information Science.
University of Borås, Swedish School of Library and Information Science.
2011 (English)In: International Journal on Digital Libraries, ISSN 1432-5012, E-ISSN 1432-1300Article in journal (Refereed)
Abstract [en]

Digital libraries increasingly bene t from re- search on automated text categorization for improved access. Such research is typically carried out by using standard test collections. In this paper we present a pilot experiment of replacing such test collections by a set of 6000 objects from a real-world digital repos- itory, indexed by Library of Congress Subject Head- ings, and test support vector machines in a supervised learning setting for their ability to reproduce the exist- ing classi cation. To augment the standard approach, we introduce a combination of two novel elements: us- ing functions for document content representation in Hilbert space, and adding extra semantics from lexical resources to the representation. Results suggest that wavelet-based kernels slightly outperformed traditional kernels on classi cation reconstruction from abstracts and vice versa from full-text documents, the latter out- come due to word sense ambiguity. The practical imple- mentation of our methodological framework enhances the analysis and representation of speci c knowledge relevant to large-scale digital collections, in this case the thematic coverage of the collections. Representation of speci c knowledge about digital collections is one of the basic elements of the persistent archives and the less studied one (compared to representations of digital ob- jects and collections). Our research is an initial step in this direction developing further the methodological ap- proach and demonstrating that text categorisation can be applied to analyse the thematic coverage in digital repositories.

Place, publisher, year, edition, pages
Keyword [en]
kernel methods, text classification, support vector machines, semantic enrichment, hilbert spaces, digital libraries, text categorization, machine learning, analogical information representation, wavelet analysis
National Category
Information Studies Computer and Information Science Language Technology (Computational Linguistics)
Research subject
Library and Information Science
URN: urn:nbn:se:hb:diva-3241Local ID: 2320/9820OAI: diva2:871338
Available from: 2015-11-13 Created: 2015-11-13

Open Access in DiVA

fulltext(329 kB)55 downloads
File information
File name FULLTEXT01.pdfFile size 329 kBChecksum SHA-512
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Darányi, SándorWittek, Peter
By organisation
Swedish School of Library and Information Science
In the same journal
International Journal on Digital Libraries
Information StudiesComputer and Information ScienceLanguage Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 55 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 118 hits
ReferencesLink to record
Permanent link

Direct link