Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Classifying Amharic Webnews
Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
Stockholms universitet, Samhällsvetenskapliga fakulteten, Institutionen för data- och systemvetenskap.
Norwegian University of Science and Technology, Trondheim, Norway; SICS, Swedish Institute of Computer Science AB, Kista, Sweden.
Addis Ababa University, Addis Ababa, Ethiopia.
Visa övriga samt affilieringar
2009 (Engelska)Ingår i: Information retrieval (Boston), ISSN 1386-4564, E-ISSN 1573-7659, Vol. 12, nr 3, 416-435 s.Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

We present work aimed at compiling an Amharic corpus from the Web and automatically categorizing the texts. Amharic is the second most spoken Semitic language in the World (after Arabic) and used for countrywide communication in Ethiopia. It is highly inflectional and quite dialectally diversified. We discuss the issues of compiling and annotating a corpus of Amharic news articles from the Web. This corpus was then used in three sets of text classification experiments. Working with a less-researched language highlights a number of practical issues that might otherwise receive less attention or go unnoticed. The purpose of the experiments has not primarily been to develop a cutting-edge text classification system for Amharic, but rather to put the spotlight on some of these issues. The first two sets of experiments investigated the use of Self-Organizing Maps (SOMs) for document classification. Testing on small datasets, we first looked at classifying unseen data into 10 predefined categories of news items, and then at clustering it around query content, when taking 16 queries as class labels. The second set of experiments investigated the effect of operations such as stemming and part-of-speech tagging on text classification performance. We compared three representations while constructing classification models based on bagging of decision trees for the 10 predefined news categories. The best accuracy was achieved using the full text as representation. A representation using only the nouns performed almost equally well, confirming the assumption that most of the information required for distinguishing between various categories actually is contained in the nouns, while stemming did not have much effect on the performance of the classifier.

Ort, förlag, år, upplaga, sidor
2009. Vol. 12, nr 3, 416-435 s.
Nyckelord [en]
Web mining, Text classification, Semitic languages
Nationell ämneskategori
Data- och informationsvetenskap
Identifikatorer
URN: urn:nbn:se:su:diva-33390DOI: 10.1007/s10791-008-9080-xISI: 000265040100010OAI: oai:DiVA.org:su-33390DiVA: diva2:283073
Anmärkning

Special Issue on Non-English Web Retrieval

Tillgänglig från: 2009-12-23 Skapad: 2009-12-23 Senast uppdaterad: 2015-09-18Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas

Övriga länkar

Förlagets fulltext

Sök vidare i DiVA

Av författaren/redaktören
Asker, Lars
Av organisationen
Institutionen för data- och systemvetenskap
I samma tidskrift
Information retrieval (Boston)
Data- och informationsvetenskap

Sök vidare utanför DiVA

GoogleGoogle Scholar

Altmetricpoäng

Totalt: 40 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf