Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Low Supervision, Low Corpus size, Low Similarity! Challenges in cross-lingual alignment of word embeddings: An exploration of the limitations of cross-lingual word embedding alignment in truly low resource scenarios
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Språkteknologi)
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Cross-lingual word embeddings are an increasingly important reseource in cross-lingual methods for NLP, particularly for their role in transfer learning and unsupervised machine translation, purportedly opening up the opportunity for NLP applications for low-resource languages.  However, most research in this area implicitly expects the availablility of vast monolingual corpora for training embeddings, a scenario which is not realistic for many of the world's languages.  Moreover, much of the reporting of the performance of cross-lingual word embeddings is based on a fairly narrow set of mostly European language pairs.  Our study examines the performance of cross-lingual alignment across a more diverse set of language pairs; controls for the effect of the corpus size on which the monolingual embedding spaces are trained; and studies the impact of spectral graph properties of the embedding spsace on alignment.  Through our experiments on a more diverse set of language pairs, we find that performance in bilingual lexicon induction is generally poor in heterogeneous pairs, and that even using a gold or heuristically derived dictionary has little impact on the performance on these pairs of languages.  We also find that the performance for these languages only increases slowly with corpus size.  Finally, we find a moderate correlation between the isospectral difference of the source and target embeddings and the performance of bilingual lexicon induction.  We infer that methods other than cross-lingual alignment may be more appropriate in the case of both low resource languages and heterogeneous language pairs.

Place, publisher, year, edition, pages
2019. , p. 53
Keywords [en]
word embeddings, cross-lingual, multilingual, low-resource, corpus size, Vecmap, FastText, alignment, orthogonal, eigenvalues, Laplacian, isospectral, isomorphic, bilingual lexicon induction
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:uu:diva-395946OAI: oai:DiVA.org:uu-395946DiVA, id: diva2:1365879
Subject / course
Language Technology
Educational program
Master Programme in Language Technology
Presentation
2019-10-07, 2-0023, Engelska Parken, Thunbergsvägen 3H, 751 20 UPPSALA, Uppsala, 16:31 (English)
Supervisors
Examiners
Available from: 2019-10-28 Created: 2019-10-25 Last updated: 2019-10-28Bibliographically approved

Open Access in DiVA

ADmastersthesis(2112 kB)48 downloads
File information
File name FULLTEXT01.pdfFile size 2112 kBChecksum SHA-512
5b6be85bdca7abaa352574b03478d08a17e32a492e33f5699dd40ec0e8e6b25f0986a85c90b44b0dc884b07ba915b83fe921077ffa727c14467297d04d7fe592
Type fulltextMimetype application/pdf

By organisation
Department of Linguistics and Philology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 48 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 111 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf