Digitala Vetenskapliga Arkivet

RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Principal Word Vectors
Uppsala universitet, Humanistisk-samhällsvetenskapliga vetenskapsområdet, Språkvetenskapliga fakulteten, Institutionen för lingvistik och filologi. (Computational Linguistics)
2018 (Engelska)Doktorsavhandling, monografi (Övrigt vetenskapligt)
Abstract [en]

Word embedding is a technique for associating the words of a language with real-valued vectors, enabling us to use algebraic methods to reason about their semantic and grammatical properties. This thesis introduces a word embedding method called principal word embedding, which makes use of principal component analysis (PCA) to train a set of word embeddings for words of a language. The principal word embedding method involves performing a PCA on a data matrix whose elements are the frequency of seeing words in different contexts. We address two challenges that arise in the application of PCA to create word embeddings. The first challenge is related to the size of the data matrix on which PCA is performed and affects the efficiency of the word embedding method. The data matrix is usually a large matrix that requires a very large amount of memory and CPU time to be processed. The second challenge is related to the distribution of word frequencies in the data matrix and affects the quality of the word embeddings. We provide an extensive study of the distribution of the elements of the data matrix and show that it is unsuitable for PCA in its unmodified form.

We overcome the two challenges in principal word embedding by using a generalized PCA method. The problem with the size of the data matrix is mitigated by a randomized singular value decomposition (SVD) procedure, which improves the performance of PCA on the data matrix. The data distribution is reshaped by an adaptive transformation function, which makes it more suitable for PCA. These techniques, together with a weighting mechanism that generalizes many different weighting and transformation approaches used in literature, enable the principal word embedding to train high quality word embeddings in an efficient way.

We also provide a study on how principal word embedding is connected to other word embedding methods. We compare it to a number of word embedding methods and study how the two challenges in principal word embedding are addressed in those methods. We show that the other word embedding methods are closely related to principal word embedding and, in many instances, they can be seen as special cases of it.

The principal word embeddings are evaluated in both intrinsic and extrinsic ways. The intrinsic evaluations are directed towards the study of the distribution of word vectors. The extrinsic evaluations measure the contribution of principal word embeddings to some standard NLP tasks. The experimental results confirm that the newly proposed features of principal word embedding (i.e., the randomized SVD algorithm, the adaptive transformation function, and the weighting mechanism) are beneficial to the method and lead to significant improvements in the results. A comparison between principal word embedding and other popular word embedding methods shows that, in many instances, the proposed method is able to generate word embeddings that are better than or as good as other word embeddings while being faster than several popular word embedding methods.

Ort, förlag, år, upplaga, sidor
Uppsala: Acta Universitatis Upsaliensis, 2018. , s. 159
Serie
Studia Linguistica Upsaliensia, ISSN 1652-1366 ; 22
Nyckelord [en]
word, context, word embedding, principal component analysis, PCA, sparse matrix, singular value decomposition, SVD, entropy
Nationell ämneskategori
Jämförande språkvetenskap och allmän lingvistik Språkteknologi (språkvetenskaplig databehandling) Datorsystem
Identifikatorer
URN: urn:nbn:se:uu:diva-353866ISBN: 978-91-513-0365-9 (tryckt)OAI: oai:DiVA.org:uu-353866DiVA, id: diva2:1219609
Disputation
2018-09-08, Room 22-0008, Humanistiska teatern, 752 38, Uppsala, 09:00 (Engelska)
Opponent
Handledare
Tillgänglig från: 2018-08-14 Skapad: 2018-06-17 Senast uppdaterad: 2023-03-13

Open Access i DiVA

fulltext(1305 kB)637 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 1305 kBChecksumma SHA-512
070a2edac73998c276edbb7ee199420e7162ba4fd0e18c0023e874c62af7e3c55becb973b59a1a1d3c300dc78a46452c802d3c04879a969066245aaa1feca8a5
Typ fulltextMimetyp application/pdf
Köp publikationen >>

Sök vidare i DiVA

Av författaren/redaktören
Basirat, Ali
Av organisationen
Institutionen för lingvistik och filologi
Jämförande språkvetenskap och allmän lingvistikSpråkteknologi (språkvetenskaplig databehandling)Datorsystem

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 637 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 2357 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf