Rule-Based Normalisation of Historical Text – a Diachronic Study
2012 (English)In: Empirical Methods in Natural Language Processing: Proceedings of the Conference on Natural Language Processing 2012 (KONVENS 2012), Vienna, Austria, September 19-21, 2012 / [ed] Jeremy Jancsary (ed.), Wien: Österreichische Gesellschaft für Artificial Intelligence (ÖGAI) , 2012, 333-341 p.Conference paper, Presentation (Refereed)
Language technology tools can be very use- ful for making information concealed in historical documents more easily accessi- ble to historians, linguists and other re- searchers in humanities. For many lan- guages, there is however a lack of linguis- tically annotated historical data that could be used for training NLP tools adapted to historical text. One way of avoiding the data sparseness problem in this context is to normalise the input text to a more modern spelling, before applying NLP tools trained on contemporary corpora. In this paper, we explore the impact of a set of hand-crafted normalisation rules on Swedish texts rang- ing from 1527 to 1812. Normalisation ac- curacy as well as tagging and parsing per- formance are evaluated. We show that, even though the rules were generated on the basis of one 17th century text sample, the rules are applicable to all texts, regard- less of time period and text genre. This clearly indicates that spelling correction is a useful strategy for applying contemporary NLP tools to historical text.
Place, publisher, year, edition, pages
Wien: Österreichische Gesellschaft für Artificial Intelligence (ÖGAI) , 2012. 333-341 p.
, Schriftenreihe der Österreichischen Gesellschaft für Artificial Intelligence, 5
automatic processing of historical texts
Research subject Computational Linguistics
IdentifiersURN: urn:nbn:se:uu:diva-189424OAI: oai:DiVA.org:uu-189424DiVA: diva2:581395
LThist 2012 – First International Workshop on Language Technology for Historical Text(s), 11th Conference on Natural Language Processing (KONVENS 2012), September 19-21, 2012, Vienna, Austria