Digitala Vetenskapliga Arkivet

Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
The Challenge of Diacritics in Yorùbá Embeddings
Luleå tekniska universitet, Institutionen för system- och rymdteknik, EISLAB.ORCID-id: 0000-0002-5582-2031
Luleå tekniska universitet, Institutionen för system- och rymdteknik, EISLAB.ORCID-id: 0000-0002-6756-0147
Luleå tekniska universitet, Institutionen för system- och rymdteknik, EISLAB.ORCID-id: 0000-0003-4029-6574
2020 (Engelska)Ingår i: ML4D 2020 Proceedings / [ed] Tejumade Afonja; Konstantin Klemmer; Aya Salama; Paula Rodriguez Diaz; Niveditha Kalavakonda; Oluwafemi Azeez, Neural Information Processing Systems Foundation , 2020, artikel-id 2011.07605Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

The major contributions of this work include the empirical establishment of a better performance for Yoruba embeddings from undiacritized (normalized) dataset and provision of new analogy sets for evaluation.The Yoruba language, being a tonal language, utilizes diacritics (tonal marks) in written form. We show that this affects embedding performance by creating embeddings from exactly the same Wikipedia dataset but with the second one normalized to be undiacritized. We further compare average intrinsic performance with two other work (using analogy test set & WordSim) and we obtain the best performance in WordSim and corresponding Spearman correlation.

Ort, förlag, år, upplaga, sidor
Neural Information Processing Systems Foundation , 2020. artikel-id 2011.07605
Nyckelord [en]
Yoruba, NLP, Diacritics, Embeddings
Nationell ämneskategori
Datavetenskap (datalogi)
Forskningsämne
Maskininlärning
Identifikatorer
URN: urn:nbn:se:ltu:diva-81569OAI: oai:DiVA.org:ltu-81569DiVA, id: diva2:1503262
Konferens
Workshop: Machine Learning for the Developing World (ML4D) at 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Online, December 12, 2020
Forskningsfinansiär
Vinnova, 2019-02996Tillgänglig från: 2020-11-23 Skapad: 2020-11-23 Senast uppdaterad: 2025-10-22Bibliografiskt granskad
Ingår i avhandling
1. Word Vector Representations using Shallow Neural Networks
Öppna denna publikation i ny flik eller fönster >>Word Vector Representations using Shallow Neural Networks
2021 (Engelska)Licentiatavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

This work highlights some important factors for consideration when developing word vector representations and data-driven conversational systems. The neural network methods for creating word embeddings have gained more prominence than their older, count-based counterparts.However, there are still challenges, such as prolonged training time and the need for more data, especially with deep neural networks. Shallow neural networks with lesser depth appear to have the advantage of less complexity, however, they also face challenges, such as sub-optimal combination of hyper-parameters which produce sub-optimal models. This work, therefore, investigates the following research questions: "How importantly do hyper-parameters influence word embeddings’ performance?" and "What factors are important for developing ethical and robust conversational systems?" In answering the questions, various experiments were conducted using different datasets in different studies. The first study investigates, empirically, various hyper-parameter combinations for creating word vectors and their impact on a few natural language processing (NLP) downstream tasks: named entity recognition (NER) and sentiment analysis (SA). The study shows that optimal performance of embeddings for downstream \acrshort{nlp} tasks depends on the task at hand.It also shows that certain combinations give strong performance across the tasks chosen for the study. Furthermore, it shows that reasonably smaller corpora are sufficient or even produce better models in some cases and take less time to train and load. This is important, especially now that environmental considerations play prominent role in ethical research. Subsequent studies build on the findings of the first and explore the hyper-parameter combinations for Swedish and English embeddings for the downstream NER task. The second study presents the new Swedish analogy test set for evaluation of Swedish embeddings. Furthermore, it shows that character n-grams are useful for Swedish, a morphologically rich language. The third study shows that broad coverage of topics in a corpus appears to be important to produce better embeddings and that noise may be helpful in certain instances, though they are generally harmful. Hence, relatively smaller corpus can show better performance than a larger one, as demonstrated in the work with the smaller Swedish Wikipedia corpus against the Swedish Gigaword. The argument is made, in the final study (in answering the second question) from the point of view of the philosophy of science, that the near-elimination of the presence of unwanted bias in training data and the use of foralike the peer-review, conferences, and journals to provide the necessary avenues for criticism and feedback are instrumental for the development of ethical and robust conversational systems.

Ort, förlag, år, upplaga, sidor
Luleå: Luleå University of Technology, 2021. s. 93
Nyckelord
Word vectors, NLP, Neural networks, Embeddings
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-83578 (URN)978-91-7790-810-4 (ISBN)978-91-7790-811-1 (ISBN)
Presentation
2021-05-26, A109, LTU, Luleå, 09:00 (Engelska)
Opponent
Handledare
Tillgänglig från: 2021-04-12 Skapad: 2021-04-10 Senast uppdaterad: 2025-10-21Bibliografiskt granskad

Open Access i DiVA

fulltext(98 kB)640 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 98 kBChecksumma SHA-512
4b9d6c895c2d08e5b922382e7fc8d05116dfe4776872374facb5151707952a457de2f101d0848144c885542a7cb209bffd06c79fba28596fe6e6253eab63540e
Typ fulltextMimetyp application/pdf

Övriga länkar

https://arxiv.org/pdf/2011.07605.pdf

Sök vidare i DiVA

Av författaren/redaktören
Adewumi, Tosin P.Liwicki, FoteiniLiwicki, Marcus
Av organisationen
EISLAB
Datavetenskap (datalogi)

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 642 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

urn-nbn

Altmetricpoäng

urn-nbn
Totalt: 398 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf