Digitala Vetenskapliga Arkivet

System disruptions
We are currently experiencing disruptions on the search portals due to high traffic. We are working to resolve the issue, you may temporarily encounter an error message.
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Word Clustering in an Interactive Text Analysis Tool
Linköping University, Department of Computer and Information Science, Human-Centered systems.
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Klustring av ord i ett interaktivt textanalysverktyg (Swedish)
Abstract [en]

A central operation of users of the text analysis tool Gavagai Explorer is to look through a list of words and arrange them in groups. This thesis explores the use of word clustering to automatically arrange the words in groups intended to help users. A new word clustering algorithm is introduced, which attempts to produce word clusters tailored to be small enough for a user to quickly grasp the common theme of the words. The proposed algorithm computes similarities among words using word embeddings, and clusters them using hierarchical graph clustering. Multiple variants of the algorithm are evaluated in an unsupervised manner by analysing the clusters they produce when applied to 110 data sets previously analysed by users of Gavagai Explorer. A supervised evaluation is performed to compare clusters to the groups of words previously created by users of Gavagai Explorer. Results show that it was possible to choose a set of hyperparameters deemed to perform well across most data sets in the unsupervised evaluation. These hyperparameters also performed among the best on the supervised evaluation. It was concluded that the choice of word embedding and graph clustering algorithm had little impact on the behaviour of the algorithm. Rather, limiting the maximum size of clusters and filtering out similarities between words had a much larger impact on behaviour.

Place, publisher, year, edition, pages
2019. , p. 49
Keywords [en]
word clustering, word embedding, distributional semantics, hierarchical clustering, text analytics, language technology, natural language processing, gavagai
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:liu:diva-157497ISRN: LIU-IDA/LITH-EX-A--19/028--SEOAI: oai:DiVA.org:liu-157497DiVA, id: diva2:1324935
External cooperation
Gavagai AB
Subject / course
Computer Engineering
Supervisors
Examiners
Available from: 2019-06-14 Created: 2019-06-14 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

fulltext(670 kB)569 downloads
File information
File name FULLTEXT01.pdfFile size 670 kBChecksum SHA-512
cf12cdfde68835cd8a5a16478fe77f67406900b60021a41bb6a91643ce131a61a68f6bd6a6ea1a930f528a14068b0341f8449baffd3904d32a9aac9dff42a1bf
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Gränsbo, Gustav
By organisation
Human-Centered systems
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 569 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 1273 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf