Digitala Vetenskapliga Arkivet

Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Word Vector Representations using Shallow Neural Networks
Luleå tekniska universitet, Institutionen för system- och rymdteknik, EISLAB. (Machine Learning)ORCID-id: 0000-0002-5582-2031
2021 (Engelska)Licentiatavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

This work highlights some important factors for consideration when developing word vector representations and data-driven conversational systems. The neural network methods for creating word embeddings have gained more prominence than their older, count-based counterparts.However, there are still challenges, such as prolonged training time and the need for more data, especially with deep neural networks. Shallow neural networks with lesser depth appear to have the advantage of less complexity, however, they also face challenges, such as sub-optimal combination of hyper-parameters which produce sub-optimal models. This work, therefore, investigates the following research questions: "How importantly do hyper-parameters influence word embeddings’ performance?" and "What factors are important for developing ethical and robust conversational systems?" In answering the questions, various experiments were conducted using different datasets in different studies. The first study investigates, empirically, various hyper-parameter combinations for creating word vectors and their impact on a few natural language processing (NLP) downstream tasks: named entity recognition (NER) and sentiment analysis (SA). The study shows that optimal performance of embeddings for downstream \acrshort{nlp} tasks depends on the task at hand.It also shows that certain combinations give strong performance across the tasks chosen for the study. Furthermore, it shows that reasonably smaller corpora are sufficient or even produce better models in some cases and take less time to train and load. This is important, especially now that environmental considerations play prominent role in ethical research. Subsequent studies build on the findings of the first and explore the hyper-parameter combinations for Swedish and English embeddings for the downstream NER task. The second study presents the new Swedish analogy test set for evaluation of Swedish embeddings. Furthermore, it shows that character n-grams are useful for Swedish, a morphologically rich language. The third study shows that broad coverage of topics in a corpus appears to be important to produce better embeddings and that noise may be helpful in certain instances, though they are generally harmful. Hence, relatively smaller corpus can show better performance than a larger one, as demonstrated in the work with the smaller Swedish Wikipedia corpus against the Swedish Gigaword. The argument is made, in the final study (in answering the second question) from the point of view of the philosophy of science, that the near-elimination of the presence of unwanted bias in training data and the use of foralike the peer-review, conferences, and journals to provide the necessary avenues for criticism and feedback are instrumental for the development of ethical and robust conversational systems.

Ort, förlag, år, upplaga, sidor
Luleå: Luleå University of Technology, 2021. , s. 93
Nyckelord [en]
Word vectors, NLP, Neural networks, Embeddings
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
Maskininlärning
Identifikatorer
URN: urn:nbn:se:ltu:diva-83578ISBN: 978-91-7790-810-4 (tryckt)ISBN: 978-91-7790-811-1 (digital)OAI: oai:DiVA.org:ltu-83578DiVA, id: diva2:1543295
Presentation
2021-05-26, A109, LTU, Luleå, 09:00 (Engelska)
Opponent
Handledare
Tillgänglig från: 2021-04-12 Skapad: 2021-04-10 Senast uppdaterad: 2025-02-07Bibliografiskt granskad
Delarbeten
1. Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks
Öppna denna publikation i ny flik eller fönster >>Word2Vec: Optimal Hyper-Parameters and Their Impact on NLP Downstream Tasks
(Engelska)Manuskript (preprint) (Övrigt vetenskapligt)
Abstract [en]

Word2Vec is a prominent model for natural language processing (NLP) tasks. Similar nspiration is found in distributed embeddings for new state-of-the-art (SotA) deep neural networks.  However, wrong combination of hyper-parameters can produce poor quality vectors. The objective of this work is to empirically show optimal combination of hyper-parameters exists and evaluate various combinations. We compare them with the released, pre-trained original word2vec model. Both intrinsic and extrinsic (downstream) evaluations, including named entity recognition (NER) and sentiment analysis (SA) were carried out. The downstream tasks reveal that the best model is usually task-specific, high analogy scores don’t necessarily correlate positively with F1 scores and the same applies to focus on data alone. Increasing vector dimension size after a point leads to poor quality or performance. If ethical considerations to save time, energy and the environment are made, then reasonably smaller corpora may do just as well or even better in some cases. Besides, using a small corpus, we obtain better human-assigned WordSim scores, corresponding Spearman correlation and better downstream performances (with significance tests) compared to the original model, trained on 100 billion-word corpus.

Nyckelord
Word2Vec, NLP, Named Entity Recognition, Sentiment Analysis, Hyperparameters
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-80620 (URN)
Forskningsfinansiär
Vinnova, 2019-02996
Tillgänglig från: 2020-08-31 Skapad: 2020-08-31 Senast uppdaterad: 2025-02-07
2. Exploring Swedish & English fastText Embeddings for NER with the Transformer
Öppna denna publikation i ny flik eller fönster >>Exploring Swedish & English fastText Embeddings for NER with the Transformer
(Engelska)Manuskript (preprint) (Övrigt vetenskapligt)
Abstract [en]

In this paper, our main contributions are that embeddings from relatively smaller corpora can outperform ones from far larger corpora and we present the new Swedish analogy test set. To achieve a good network performance in natural language processing (NLP) downstream tasks, several factors play important roles: dataset size, the right hyper-parameters, and well-trained embeddings. We show that, with the right set of hyper-parameters, good network performance can be reached even on smaller datasets. We evaluate the embeddings at the intrinsic level and extrinsic level, by deploying them on the Transformer in named entity recognition (NER) task and conduct significance tests. This is done for both Swedish and English. We obtain better performance in both languages on the downstream task with far smaller training data, compared to recently released, common crawl versions; and character n-grams appear useful for Swedish, a morphologically rich language.

Nyckelord
Embeddings, Transformer, Analogy, Dataset, NER, Swedish
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-80622 (URN)
Forskningsfinansiär
Vinnova, 2019-02996
Tillgänglig från: 2020-08-31 Skapad: 2020-08-31 Senast uppdaterad: 2025-02-07
3. Corpora Compared: The Case of the Swedish Gigaword & Wikipedia Corpora
Öppna denna publikation i ny flik eller fönster >>Corpora Compared: The Case of the Swedish Gigaword & Wikipedia Corpora
2020 (Engelska)Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

In this work, we show that the difference in performance of embeddings from differently sourced data for a given language can be due to other factors besides data size. Natural language processing (NLP) tasks usually perform better with embeddings from bigger corpora. However, broadness of covered domain and noise can play important roles. We evaluate embeddings based on two Swedish corpora: The Gigaword and Wikipedia, in analogy (intrinsic) tests and discover that the embeddings from the Wikipedia corpus generally outperform those from the Gigaword corpus, which is a bigger corpus. Downstream tests will be required to have a definite evaluation.

Ort, förlag, år, upplaga, sidor
Göteborgs universitet, 2020
Nyckelord
Swedish, Wikipedia, Gigaword, NLP
Nationell ämneskategori
Datavetenskap (datalogi)
Forskningsämne
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-81567 (URN)
Konferens
The Eighth Swedish Language Technology Conference (SLTC), 25-27 November, 2020, Online
Forskningsfinansiär
Vinnova, 2019-02996
Tillgänglig från: 2020-11-23 Skapad: 2020-11-23 Senast uppdaterad: 2022-10-28Bibliografiskt granskad
4. The Challenge of Diacritics in Yorùbá Embeddings
Öppna denna publikation i ny flik eller fönster >>The Challenge of Diacritics in Yorùbá Embeddings
2020 (Engelska)Ingår i: ML4D 2020 Proceedings / [ed] Tejumade Afonja; Konstantin Klemmer; Aya Salama; Paula Rodriguez Diaz; Niveditha Kalavakonda; Oluwafemi Azeez, Neural Information Processing Systems Foundation , 2020, artikel-id 2011.07605Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

The major contributions of this work include the empirical establishment of a better performance for Yoruba embeddings from undiacritized (normalized) dataset and provision of new analogy sets for evaluation.The Yoruba language, being a tonal language, utilizes diacritics (tonal marks) in written form. We show that this affects embedding performance by creating embeddings from exactly the same Wikipedia dataset but with the second one normalized to be undiacritized. We further compare average intrinsic performance with two other work (using analogy test set & WordSim) and we obtain the best performance in WordSim and corresponding Spearman correlation.

Ort, förlag, år, upplaga, sidor
Neural Information Processing Systems Foundation, 2020
Nyckelord
Yoruba, NLP, Diacritics, Embeddings
Nationell ämneskategori
Datavetenskap (datalogi)
Forskningsämne
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-81569 (URN)
Konferens
Workshop: Machine Learning for the Developing World (ML4D) at 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Online, December 12, 2020
Forskningsfinansiär
Vinnova, 2019-02996
Tillgänglig från: 2020-11-23 Skapad: 2020-11-23 Senast uppdaterad: 2022-10-28Bibliografiskt granskad
5. Conversational Systems in Machine Learning from the Point of View of the Philosophy of Science—Using Alime Chat and Related Studies
Öppna denna publikation i ny flik eller fönster >>Conversational Systems in Machine Learning from the Point of View of the Philosophy of Science—Using Alime Chat and Related Studies
2019 (Engelska)Ingår i: Philosophies, ISSN 2409-9287, Vol. 4, nr 3, artikel-id 41Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

This essay discusses current research efforts in conversational systems from the philosophy of science point of view and evaluates some conversational systems research activities from the standpoint of naturalism philosophical theory. Conversational systems or chatbots have advanced over the decades and now have become mainstream applications. They are software that users can communicate with, using natural language. Particular attention is given to the Alime Chat conversational system, already in industrial use, and the related research. The competitive nature of systems in production is a result of different researchers and developers trying to produce new conversational systems that can outperform previous or state-of-the-art systems. Different factors affect the quality of the conversational systems produced, and how one system is assessed as being better than another is a function of objectivity and of the relevant experimental results. This essay examines the research practices from, among others, Longino’s view on objectivity and Popper’s stand on falsification. Furthermore, the need for qualitative and large datasets is emphasized. This is in addition to the importance of the peer-review process in scientific publishing, as a means of developing, validating, or rejecting theories, claims, or methodologies in the research community. In conclusion, open data and open scientific discussion fora should become more prominent over the mere publication-focused trend.

Ort, förlag, år, upplaga, sidor
Switzerland: MDPI, 2019
Nyckelord
conversational systems, chatbots, philosophy of science, objectivity, verification, falsification
Nationell ämneskategori
Datavetenskap (datalogi)
Forskningsämne
Maskininlärning
Identifikatorer
urn:nbn:se:ltu:diva-75430 (URN)10.3390/philosophies4030041 (DOI)000613786500007 ()2-s2.0-85094795286 (Scopus ID)
Anmärkning

Validerad;2019;Nivå 1;2019-09-18 (marisr)

Tillgänglig från: 2019-08-08 Skapad: 2019-08-08 Senast uppdaterad: 2022-10-28Bibliografiskt granskad

Open Access i DiVA

fulltext(1588 kB)349 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 1588 kBChecksumma SHA-512
8a83d2f686be07a87e59da69b29d78dec73a6c3bbc45ccd72f362f7503e5aab9170c4cb67dd2278a7afa6d977f415c3d90f8fcc95ed907434e4dcbd072b106c9
Typ fulltextMimetyp application/pdf

Sök vidare i DiVA

Av författaren/redaktören
Adewumi, Oluwatosin
Av organisationen
EISLAB
Språkbehandling och datorlingvistik

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 349 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 765 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf