Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Word2vec2syn: Synonymidentifiering med Word2vec
Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
2019 (Swedish)Independent thesis Basic level (degree of Bachelor), 12 credits / 18 HE creditsStudent thesisAlternative title
Word2vec2syn : Synonym Identification using Word2vec (English)
Abstract [sv]

Inom NLP (eng. natural language processing) är synonymidentifiering en av de språkvetenskapliga utmaningarna som många antar. Fodina Language Technology AB är ett företag som skapat ett verktyg, Termograph, ämnad att samla termer inom företag och hålla den interna språkanvändningen konsekvent. En metodkombination bestående av språkteknologiska strategier utgör synonymidentifieringen och Fodina önskar ett större täckningsområde samt mer dynamik i framtagningsprocessen. Därav syftade detta arbete till att ta fram en ny metod, utöver metodkombinationen, för just synonymidentifiering. En färdigtränad Word2vec-modell användes och den inbyggda funktionen för cosinuslikheten användes för att få fram synonymer och skapa kluster. Modellen validerades, testades och utvärderades i förhållande till metodkombinationen. Valideringen visade att modellen skattade inom ett rimligt mänskligt spann i genomsnitt 60,30 % av gångerna och Spearmans korrelation visade på en signifikant stark korrelation. Testningen visade att 32 % av de bearbetade klustren innehöll matchande synonymförslag. Utvärderingen visade att i de fall som förslagen inte matchade så var modellens synonymförslag korrekta i 5,73 % av fallen jämfört med 3,07 % för metodkombinationen. Den interna reliabiliteten för utvärderarna visade på en befintlig men svag enighet, Fleiss Kappa = 0,19, CI(0,06, 0,33). Trots viss osäkerhet i resultaten påvisas ändå möjligheter för vidare användning av word2vec-modeller inom Fodinas synonymidentifiering.

Abstract [en]

One of the main challenges in the field of natural language processing (NLP) is synonym identification. Fodina Language Technology AB is the company behind the tool, Termograph, that aims to collect terms and provide a consistent language within companies. A combination of multiple methods from the field of language technology constitutes the synonym identification and Fodina would like to improve the area of coverage and increase the dynamics of the working process. The focus of this thesis was therefore to evaluate a new method for synonym identification beyond the already used combination. Initially a trained Word2vec model was used and for the synonym identification the built-in-function for cosine similarity was applied in order to create clusters. The model was validated, tested and evaluated relative to the combination. The validation implicated that the model made estimations within a fair human-based range in an average of 60.30% and Spearmans correlation indicated a strong significant correlation. The testing showed that 32% of the processed synonym clusters contained matching synonym suggestions. The evaluation showed that the synonym suggestions from the model was correct in 5.73% of all cases compared to 3.07% for the combination in the cases where the clusters did not match. The interrater reliability indicated a slight agreement, Fleiss’ Kappa = 0.19, CI(0.06, 0.33). Despite uncertainty in the results, opportunities for further use of Word2vec-models within Fodina’s synonym identification are nevertheless demonstrated.

Place, publisher, year, edition, pages
2019. , p. 38
Keywords [en]
Word2vec, synonym identification, vector space model, word vectors, cosine similarity
Keywords [sv]
Word2vec, synonymidentifiering, vektorrymdsmodell, ordvektorer, cosinuslikhet
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:liu:diva-157638ISRN: LIU-IDA/KOGVET-G--19/001--SEOAI: oai:DiVA.org:liu-157638DiVA, id: diva2:1326166
External cooperation
Fodina Language Technology AB
Subject / course
Cognitive science
Supervisors
Examiners
Available from: 2019-06-19 Created: 2019-06-17 Last updated: 2019-06-19Bibliographically approved

Open Access in DiVA

Word2vec2syn(774 kB)34 downloads
File information
File name FULLTEXT01.pdfFile size 774 kBChecksum SHA-512
e846ff062e66e28d85de51750ab1aa646b69ec582477836ce532408d078da08fbac25d96aa912492ab4f7df7bc8ff3b8cc39a8a36b14d96bfcb251198e5bf6d3
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Pettersson, Tove
By organisation
The Division of Statistics and Machine Learning
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 34 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 156 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf