Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Named-entity recognition in Czech historical texts: Using a CNN-BiLSTM neural network model
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The thesis presents named-entity recognition in Czech historical newspapers from Modern Access to Historical Sources Project. Our goal was to create a specific corpus and annotation manual for the project and evaluate neural networks methods for named-entity recognition within the task. We created the corpus using scanned Czech historical newspapers. The scanned pages were converted to digitize text by optical character recognition (OCR) method. The data were preprocessed by deleting some OCR errors. We also defined specific named entities types for our task and created an annotation manual with examples for the project. Based on that, we annotated the final corpus. To find the most suitable neural networks model for our task, we experimented with different neural networks architectures, namely long short-term memory (LSTM), bidirectional LSTM and CNN-BiLSTM models. Moreover, we experimented with randomly initialized word embeddings that were trained during the training process and pretrained word embeddings for contemporary Czech published as open source by fastText. We achieved the best result F1 score 0.444 using CNN-BiLSTM model and the pretrained word embeddings by fastText. We found out that we do not need to normalize spelling of our historical texts to get closer to contemporary language if we use the neural network model. We provided a qualitative analysis of observed linguistics phenomena as well. We found out that some word forms and pair of words which were not frequent in our training data set were miss-tagged or not tagged at all. Based on that, we can say that larger data sets could improve the results.

Place, publisher, year, edition, pages
2019. , p. 37
Keywords [en]
named entity recognition, Czech historical texts, historical newspapers, CNN-BiLSTM neural network
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:uu:diva-385682OAI: oai:DiVA.org:uu-385682DiVA, id: diva2:1325355
External cooperation
University of West Bohemia
Educational program
Master Programme in Language Technology
Supervisors
Examiners
Available from: 2019-06-17 Created: 2019-06-15 Last updated: 2019-06-17Bibliographically approved

Open Access in DiVA

fulltext(883 kB)55 downloads
File information
File name FULLTEXT01.pdfFile size 883 kBChecksum SHA-512
4debed2004c45fbfe3ecf8fd884ad483fdd47fb365eb44be640df0ac332f261d6a58b558a2e59da699919f65e278a76124c07300acbc4493ebdd3ceb284029ab
Type fulltextMimetype application/pdf

By organisation
Department of Linguistics and Philology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 55 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 105 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf