Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Labeling Clusters of Search Results
KTH, School of Computer Science and Communication (CSC).
2013 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
A rubricera kluster av sökresultat (Swedish)
Abstract [en]

This project evaluates different algorithms which could be used in a summary information retrieval (IR) application for Swedish texts. Instead of the traditional search results the summary application would generate a summary document of the various subtopics of an IR query.

First it is noted that in order to nd subtopics of a query, some kind of document clustering is needed. k-means is chosen as a candidate document clustering algorithm and evaluated in the environment of an IR application. It is found to be fast enough and to work better than the random clustering algorithm.

Although it is argued that it is not good enough to be used in a summary/labeling context. Secondly the project looks into labeling algorithms to be used in the aforementioned IR application. Four algorithms were evaluated: TF centroid, TF- IDF centroid, mutual information and CorePhrase. None were deemed to generate high enough quality labels to be useful, but it was noted that TF-IDF is a good choice when constructing labels from an IR query.

Abstract [sv]

Det här projektet utvärderar olika algoritmer som skulle kunna komma att användas i en sammanfattande informationssökningsapplikation för svenska texter. Istället för den traditionella sökresultatslistan ska applikationen generera en sammanfattning av de olika delämnen från ett sökresultat.

Först fastlås det att för att hitta delämnen av en sökfråga så behövs det någon form av dokumentklustring. K-means algoritmen väljs som en lämplig kandidat och evalueras i kontexten sökapplikation. Den anses vara snabb nog och ger bättre resultat än slumpen. Även om det framhålls att kvalitéten inte är tillräckligt bra för att användas i en summerings-/rubriceringskontext.

Vidare tittar projektet på etiketteringsalgoritmer som ska användas i den tidigare nämnda sökapplikationen. Fyra algoritmer utvärderas: TF centroid, TF-IDF centroid, mutual information och CorePhrase. Ingen av dessa algoritmer bedöms generera etiketter av tillräckligt hög kvalité för att vara användbara, men det konstateras att TF-IDF är det bättre valet när etiketter konstrueras från ett sökresultat.

Place, publisher, year, edition, pages
2013.
National Category
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-142444OAI: oai:DiVA.org:kth-142444DiVA: diva2:701680
Educational program
Master of Science in Engineering - Computer Science and Technology
Supervisors
Examiners
Available from: 2014-03-11 Created: 2014-03-04 Last updated: 2014-03-11Bibliographically approved

Open Access in DiVA

fulltext(1082 kB)177 downloads
File information
File name FULLTEXT01.pdfFile size 1082 kBChecksum SHA-512
0ca9dee0bf7835d6f84364c944d9c42f9db7637aa8a65a5bb41d4c3428b788d3f8911b70bf8f84ae56e872b546b8ae0716983167de1b5003fed5652d88fc013f
Type fulltextMimetype application/pdf

By organisation
School of Computer Science and Communication (CSC)
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 177 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 1774 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf