Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Discourse in Statistical Machine Translation
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Datorlingvistik)
2014 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

This thesis addresses the technical and linguistic aspects of discourse-level processing in phrase-based statistical machine translation (SMT). Connected texts can have complex text-level linguistic dependencies across sentences that must be preserved in translation. However, the models and algorithms of SMT are pervaded by locality assumptions. In a standard SMT setup, no model has more complex dependencies than an n-gram model. The popular stack decoding algorithm exploits this fact to implement efficient search with a dynamic programming technique. This is a serious technical obstacle to discourse-level modelling in SMT.

From a technical viewpoint, the main contribution of our work is the development of a document-level decoder based on stochastic local search that translates a complete document as a single unit. The decoder starts with an initial translation of the document, created randomly or by running a stack decoder, and refines it with a sequence of elementary operations. After each step, the current translation is scored by a set of feature models with access to the full document context and its translation. We demonstrate the viability of this decoding approach for different document-level models.

From a linguistic viewpoint, we focus on the problem of translating pronominal anaphora. After investigating the properties and challenges of the pronoun translation task both theoretically and by studying corpus data, a neural network model for cross-lingual pronoun prediction is presented. This network jointly performs anaphora resolution and pronoun prediction and is trained on bilingual corpus data only, with no need for manual coreference annotations. The network is then integrated as a feature model in the document-level SMT decoder and tested in an English–French SMT system. We show that the pronoun prediction network model more adequately represents discourse-level dependencies for less frequent pronouns than a simpler maximum entropy baseline with separate coreference resolution.

By creating a framework for experimenting with discourse-level features in SMT, this work contributes to a long-term perspective that strives for more thorough modelling of complex linguistic phenomena in translation. Our results on pronoun translation shed new light on a challenging, but essential problem in machine translation that is as yet unsolved.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2014. , p. 185
Series
Studia Linguistica Upsaliensia, ISSN 1652-1366 ; 15
Keywords [en]
Statistical machine translation, Discourse-level machine translation, Document decoding, Local search, Pronominal anaphora, Pronoun translation, Neural networks
National Category
Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
URN: urn:nbn:se:uu:diva-223798ISBN: 978-91-554-8963-2 (print)OAI: oai:DiVA.org:uu-223798DiVA, id: diva2:714202
Public defence
2014-06-14, Universitetshuset, Sal X, Uppsala, 10:15 (English)
Opponent
Supervisors
Available from: 2014-05-23 Created: 2014-04-25 Last updated: 2023-03-13Bibliographically approved

Open Access in DiVA

fulltext(1638 kB)4135 downloads
File information
File name FULLTEXT01.pdfFile size 1638 kBChecksum SHA-512
fd3fdd224bb3de8894e164971a87b11fee370ce60c3ce89100df85dbe87d83d17cf5eaf89c40d451366985efce284c4327fd2d8a0903fde292ce5892cba0892f
Type fulltextMimetype application/pdf
errata(259 kB)167 downloads
File information
File name ERRATA01.pdfFile size 259 kBChecksum SHA-512
c4d53624bdbe2b564dda6cbed27af78bd2d6db5d34a1c7d8fb7a3dd117a50d2d6b6763f6a188393c1f23cd126209fea27c8157fedd412244f8595e1a468dc478
Type errataMimetype application/pdf
Buy this publication >>

Search in DiVA

By author/editor
Hardmeier, Christian
By organisation
Department of Linguistics and Philology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 4138 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 5291 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf