Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Lost in Transcription: Evaluating Clustering and Few-Shot learningfor transcription of historical ciphers
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology.
2021 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Where there has been a steady development of Optical Character Recognition (OCR) techniques for printed documents, the instruments that provide good quality for hand-written manuscripts by Hand-written Text Recognition  methods (HTR) and transcriptions are still some steps behind. With the main focus on historical ciphers (i.e. encrypted documents from the past with various types of symbol sets), this thesis examines the performance of two machine learning architectures developed within the DECRYPT project framework, a clustering based unsupervised algorithm and a semi-supervised few-shot deep-learning model. Both models are tested on seen and unseen scribes to evaluate the difference in performance and the shortcomings of the two architectures, with the secondary goal of determining the influences of the datasets on the performance. An in-depth analysis of the transcription results is performed with particular focus on the Alchemic and Zodiac symbol sets, with analysis of the model performance relative to character shape and size. The results show the promising performance of Few-Shot architectures when compared to Clustering algorithm, with a respective SER average of 0.336 (0.15 and 0.104 on seen data / 0.754 on unseen data) and 0.596 (0.638 and 0.350 on seen data / 0.8 on unseen data).

Place, publisher, year, edition, pages
2021. , p. 54
Keywords [en]
Image Recognition, Handwritten Text Recognition, HTR, Deep-learning, K-mean clustering, NN, Neural Network, Few-Shot
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:uu:diva-460248OAI: oai:DiVA.org:uu-460248DiVA, id: diva2:1616686
Subject / course
Language Technology
Educational program
Master Programme in Language Technology
Supervisors
Examiners
Available from: 2021-12-04 Created: 2021-12-03 Last updated: 2021-12-04Bibliographically approved

Open Access in DiVA

fulltext(12257 kB)170 downloads
File information
File name FULLTEXT01.pdfFile size 12257 kBChecksum SHA-512
9caaf98e9e536f50e6f57d0a708468e228fc981f239cde493fd25e0a885fb238b9f7366621a668d07334e7d44498538901584ed90ce5d2c3a2ce5fb0d3fa055b
Type fulltextMimetype application/pdf

By organisation
Department of Linguistics and Philology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 170 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 281 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf