Subword segmentation in highly compounding and morphologically complex languages
2024 (English)Independent thesis Advanced level (degree of Master (One Year)), 10 credits / 15 HE credits
Student thesis
Abstract [en]
As a consequence of large language models becoming evermore prevalent, there is an increased need to evaluate the contributing factors in their performance. Previous research has established the morphological alignment of subword segmentation techniques as an important indicator of downstream performance. However, there is still a lack of research assessing these techniques in non-English languages with higher morphological complexity. Moreover, there is a research gap regarding how these techniques perform in terms of typological factors such as compounding. To address this gap, this thesis poses the following research question: \emph{How do contemporary subword segmentation techniques perform on highly compounding and morphologically complex languages in terms of morphological alignment?} Using a mixed methods approach, this thesis combines quantitative and qualitative approaches in order to assess the performance of multiple subword segmentation techniques. The analysis is being carried out on German and Swedish data, using English as a benchmark. Additionally, the performance is analyzed by typological factors such as inflection, derivation, compounding, and affixes. As such, the language parallel CCMatrix dataset was used to train the algorithms, while MorphyNet and CoDeRooMor were used for evaluation. Furthermore, this thesis suggests an alternative approach to calculate metrics for subword segmentation, allowing the concatenation of multiple subwords. By comparing with previous metrics, this thesis demonstrates how previous metrics may favor more aggressive segmentation techniques. Consequently, the morphological superiority of the Unigram model is partly reversed across multiple languages and categories. Thereby, this thesis further contextualizes previous research on morphological alignment, encouraging the usage of alternative metrics. Moreover, Byte-Pair Encoding and the Unigram model generally showcased higher morphological alignment across languages and typological factors than WordPiece. As such, this thesis suggests that the Unigram model may be the most optimal in the Swedish setting, whereas BPE may be the most optimal in the German setting. Thus, this thesis concludes Byte-Pair Encoding and the Unigram model as the most applicable algorithms for highly compounding and morphologically complex languages.
Place, publisher, year, edition, pages
2024.
Keywords [en]
Subword segmentation, Byte-Pair Encoding, BPE, Unigram model, WordPiece, morphological alignment, fertility, OOV, LLMs, boundary metrics, inclusive boundary metrics, inclusive boundary recall
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:su:diva-242818OAI: oai:DiVA.org:su-242818DiVA, id: diva2:1955751
2025-04-302025-04-30