Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rule-Based Normalisation of Historical Text – a Diachronic Study
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (computational linguistics)
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Computational linguistics)
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (computational linguistics)
2012 (English)In: Empirical Methods in Natural Language Processing: Proceedings of the Conference on Natural Language Processing 2012 (KONVENS 2012), Vienna, Austria, September 19-21, 2012 / [ed] Jeremy Jancsary (ed.), Wien: Österreichische Gesellschaft für Artificial Intelligence (ÖGAI) , 2012, 333-341 p.Conference paper, Oral presentation only (Refereed)
Abstract [en]

Language technology tools can be very use- ful for making information concealed in historical documents more easily accessi- ble to historians, linguists and other re- searchers in humanities. For many lan- guages, there is however a lack of linguis- tically annotated historical data that could be used for training NLP tools adapted to historical text. One way of avoiding the data sparseness problem in this context is to normalise the input text to a more modern spelling, before applying NLP tools trained on contemporary corpora. In this paper, we explore the impact of a set of hand-crafted normalisation rules on Swedish texts rang- ing from 1527 to 1812. Normalisation ac- curacy as well as tagging and parsing per- formance are evaluated. We show that, even though the rules were generated on the basis of one 17th century text sample, the rules are applicable to all texts, regard- less of time period and text genre. This clearly indicates that spelling correction is a useful strategy for applying contemporary NLP tools to historical text. 

Place, publisher, year, edition, pages
Wien: Österreichische Gesellschaft für Artificial Intelligence (ÖGAI) , 2012. 333-341 p.
Series
Schriftenreihe der Österreichischen Gesellschaft für Artificial Intelligence, 5
Keyword [en]
automatic processing of historical texts
National Category
Humanities
Research subject
Computational Linguistics
Identifiers
URN: urn:nbn:se:uu:diva-189424OAI: oai:DiVA.org:uu-189424DiVA: diva2:581395
Conference
LThist 2012 – First International Workshop on Language Technology for Historical Text(s), 11th Conference on Natural Language Processing (KONVENS 2012), September 19-21, 2012, Vienna, Austria
Available from: 2013-01-10 Created: 2013-01-01 Last updated: 2017-01-25Bibliographically approved

Open Access in DiVA

fulltext(304 kB)144 downloads
File information
File name FULLTEXT02.pdfFile size 304 kBChecksum SHA-512
d1deb851e9daf74eacc183d626c6c9034990d3b08870a876379746157135ff230399aacd74ea7541cfd51584ceac0f8edb2d58a90390b877cdd7fbf996ef2dde
Type fulltextMimetype application/pdf

Other links

http://www.oegai.at/konvens2012/proceedings/50_pettersson12w/

Search in DiVA

By author/editor
Pettersson, EvaMegyesi, BeataNivre, Joakim
By organisation
Department of Linguistics and Philology
Humanities

Search outside of DiVA

GoogleGoogle Scholar
Total: 144 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 433 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf