Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
From raw text to Universal Dependencies: look, no tags!
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Computational Linguistics)ORCID iD: 0000-0001-8844-2126
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Computational Linguistics)
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Computational Linguistics)
Bar-Ilan University.
Show others and affiliations
2017 (English)In: Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Vancouver, Canada: Association for Computational Linguistics, 2017, p. 207-217Conference paper, Published paper (Refereed)
Abstract [en]

We present the Uppsala submission to the CoNLL 2017 shared task on parsing from raw text to universal dependencies. Our system is a simple pipeline consisting of two components. The first performs joint word and sentence segmentation on raw text; the second predicts dependency trees from raw words. The parser bypasses the need for part-of-speech tagging, but uses word embeddings based on universal tag distributions. We achieved a macroaveraged LAS F1 of 65.11 in the official test run and obtained the 2nd best result for sentence segmentation with a score of 89.03. After fixing two bugs, we obtained an unofficial LAS F1 of 70.49.

Place, publisher, year, edition, pages
Vancouver, Canada: Association for Computational Linguistics, 2017. p. 207-217
Keyword [en]
dependency, parsing, multilingual, segmentation
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
URN: urn:nbn:se:uu:diva-333439ISBN: 978-1-945626-70-8 (electronic)OAI: oai:DiVA.org:uu-333439DiVA, id: diva2:1156633
Conference
CoNLL 2017, August 3-4, 2017, Vancouver, Canada
Available from: 2017-11-13 Created: 2017-11-13 Last updated: 2018-04-10Bibliographically approved
In thesis
1. Segmenting and Tagging Text with Neural Networks
Open this publication in new window or tab >>Segmenting and Tagging Text with Neural Networks
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Segmentation and tagging of text are important preprocessing steps for higher-level natural language processing tasks. In this thesis, we apply a sequence labelling framework based on neural networks to various segmentation and tagging tasks, including sentence segmentation, word segmentation, morpheme segmentation, joint word segmentation and part-of-speech tagging, and named entity transliteration. We apply a general neural CRF model to different tasks by designing specific tag sets. In addition, we explore effective ways of representing input characters, such as utilising concatenated n-grams and sub-character features, and use ensemble decoding to mitigate the effects of random parameter initialisation.

The segmentation and tagging models are evaluated in a truly multilingual setup with more than 70 datasets. The experimental results indicate that the proposed neural CRF model is effective for segmentation and tagging in general as state-of-the-art accuracies are achieved on datasets in different languages, genres, and annotation schemes for various tasks. For word segmentation, we propose several typological factors to statistically characterise the difficulties posed by different languages and writing systems. Based on this analysis, we apply language-specific settings to the segmentation system for higher accuracy. Our system achieves substantially better results on languages that are more difficult to segment when compared to previous work. Moreover, we investigate conventionally adopted evaluation metrics for segmentation tasks. We propose that precision should be excluded and using recall alone is more adequate for sentence segmentation and word segmentation. The segmentation and tagging tools implemented along with this thesis are publicly available as experimental frameworks for future development as well as preprocessing tools for higher-level NLP tasks.

Abstract [sv]

Segmentering och taggning av text är grundläggande analyssteg som möjliggör mer avancerade tillämpningar inom språkteknologi. I denna avhandling tillämpas ett ramverk för sekvensanalys baserat på neurala nätverk på ett antal olika segmenterings- och taggningsproblem, inklusive meningssegmentering, ordsegmentering, morfemsegmentering, förenad ordsegmentering och ordklasstaggning, samt translitterering av namn. Vi tillämpar en generell neural CRF-modellen på olika problem genom att definiera olika tagguppsättningar. Vi utforskar olika sätt att representera de tecken som utgör indata till processen, såsom hopslagna teckensekvenser och grafiska särdrag, och vi använder ensemble-avkodning för att mildra effekten av slumpmässig initialisering av parametrar.

Modellerna för segmentering och taggning utvärderas på flera språk med hjälp av mer än 70 olika datamängder. De experimentella resultaten visar att den föreslagna neurala CRF-modellen är effektiv för segmentering och taggning i allmänhet, med slagkraftiga resultat för olika uppgifter, språk, genrer och annotationsscheman. För ordsegmentering föreslår vi ett antal typologiska faktorer som kan användas för att statistiskt analysera de utmaningar som ges av olika språk och skriftsystem. Denna analys kan sedan läggas till grund för språkspecifika inställningar som förbättrar segmenteringens kvalitet. Vårt system uppnår väsentligt bättre resultat än tidigare metoder på språk som är svåra att segmentera. Till sist diskuterar vi utvärderingsmetoder för segmenteringsproblem och föreslår att precision ska uteslutas till förmån för enbart täckning vid utvärdering av menings- och ordsegmentering. De verktyg för segmentering och taggning som utvecklats i samband med avhandlingsarbetet är allmänt tillgängliga för fortsatt forskning och som grundläggande analysverktyg för mer avancerade tillämpningar av språkteknologi.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2018. p. 76
Series
Studia Linguistica Upsaliensia, ISSN 1652-1366 ; 21
Keyword
neural networks, sequence labelling, multilinguality, word segmentation, sentence segmentation, morpheme segmentation, transliteration, joint word segmentation and POS tagging
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:uu:diva-348129 (URN)978-91-513-0340-6 (ISBN)
Public defence
2018-06-09, Humanistiska teatern, Thunbergsvägen 3, Uppsala, 10:15 (English)
Opponent
Supervisors
Available from: 2018-05-16 Created: 2018-04-10 Last updated: 2018-05-16

Open Access in DiVA

fulltext(184 kB)11 downloads
File information
File name FULLTEXT01.pdfFile size 184 kBChecksum SHA-512
7c7756b475ff1cc8b86486b16a7dbf9e7c371038302f3d280033598c642d7fe674e1e0a26d4d1ca3e90510050e9afdb4f9570399c27b0ce189b71a2ad1086bf0
Type fulltextMimetype application/pdf

Other links

http://universaldependencies.org/conll17/proceedings/pdf/K17-3022.pdf

Search in DiVA

By author/editor
de Lhoneux, MiryamYan, ShaoBasirat, AliStymne, SaraNivre, Joakim
By organisation
Department of Linguistics and Philology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 11 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 74 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf