Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Subword segmentation in highly compounding and morphologically complex languages
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2024 (English)Independent thesis Advanced level (degree of Master (One Year)), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

As a consequence of large language models becoming evermore prevalent, there is an increased need to evaluate the contributing factors in their performance. Previous research has established the morphological alignment of subword segmentation techniques as an important indicator of downstream performance. However, there is still a lack of research assessing these techniques in non-English languages with higher morphological complexity. Moreover, there is a research gap regarding how these techniques perform in terms of typological factors such as compounding. To address this gap, this thesis poses the following research question: \emph{How do contemporary subword segmentation techniques perform on highly compounding and morphologically complex languages in terms of morphological alignment?} Using a mixed methods approach, this thesis combines quantitative and qualitative approaches in order to assess the performance of multiple subword segmentation techniques. The analysis is being carried out on German and Swedish data, using English as a benchmark. Additionally, the performance is analyzed by typological factors such as inflection, derivation, compounding, and affixes. As such, the language parallel CCMatrix dataset was used to train the algorithms, while MorphyNet and CoDeRooMor were used for evaluation. Furthermore, this thesis suggests an alternative approach to calculate metrics for subword segmentation, allowing the concatenation of multiple subwords. By comparing with previous metrics, this thesis demonstrates how previous metrics may favor more aggressive segmentation techniques. Consequently, the morphological superiority of the Unigram model is partly reversed across multiple languages and categories. Thereby, this thesis further contextualizes previous research on morphological alignment, encouraging the usage of alternative metrics. Moreover, Byte-Pair Encoding and the Unigram model generally showcased higher morphological alignment across languages and typological factors than WordPiece. As such, this thesis suggests that the Unigram model may be the most optimal in the Swedish setting, whereas BPE may be the most optimal in the German setting. Thus, this thesis concludes Byte-Pair Encoding and the Unigram model as the most applicable algorithms for highly compounding and morphologically complex languages.

Place, publisher, year, edition, pages
2024.
Keywords [en]
Subword segmentation, Byte-Pair Encoding, BPE, Unigram model, WordPiece, morphological alignment, fertility, OOV, LLMs, boundary metrics, inclusive boundary metrics, inclusive boundary recall
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:su:diva-242818OAI: oai:DiVA.org:su-242818DiVA, id: diva2:1955751
Available from: 2025-04-30 Created: 2025-04-30

Open Access in DiVA

fulltext(547 kB)17 downloads
File information
File name FULLTEXT01.pdfFile size 547 kBChecksum SHA-512
7a085a9f4b1cac13af8d666cb0c22271ba162b2097e38079a6c123b9b5823dae2d391ca2ac096237b896d0a3be54b5f2ea4c5c8376a18c86f9160b217760a67a
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Törnqvist, Maximilian
By organisation
Department of Computer and Systems Sciences
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 17 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 130 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf