Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Interpreting the Script: Image Analysis and Machine Learning for Quantitative Studies of Pre-modern Manuscripts
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Visual Information and Interaction. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computerized Image Analysis and Human-Computer Interaction. (q2b)
2017 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The humanities have for a long time been a collection of fields that have not gained from the advancements in computational power, as predicted by Moore´s law.  Fields like medicine, biology, physics, chemistry, geology and economics have all developed quantitative tools that take advantage of the exponential increase of processing power over time.  Recent advances in computerized pattern recognition, in combination with a rapid digitization of historical document collections around the world, is about to change this.

The first part of this dissertation focuses on constructing a full system for finding handwritten words in historical manuscripts. A novel segmentation algorithm is presented, capable of finding and separating text lines in pre-modern manuscripts.  Text recognition is performed by translating the image data of the text lines into sequences of numbers, called features. Commonly used features are analysed and evaluated on manuscript sources from the Uppsala University library Carolina Rediviva and the US Library of Congress.  Decoding the text in the vast number of photographed manuscripts from our libraries makes computational linguistics and social network analysis directly applicable to historical sources. Hence, text recognition is considered a key technology for the future of computerized research methods in the humanities.

The second part of this thesis addresses digital palaeography, using a computers superior capacity for endlessly performing measurements on ink stroke shapes. Objective criteria of character shapes only partly catches what a palaeographer use for assessing similarity. The palaeographer often gets a feel for the scribe's style.  This is, however, hard to quantify.  A method for identifying the scribal hands of a pre-modern copy of the revelations of saint Bridget of Sweden, using semi-supervised learning, is presented.  Methods for production year estimation are presented and evaluated on a collection with close to 11000 medieval charters.  The production dates are estimated using a Gaussian process, where the uncertainty is inferred together with the most likely production year.

In summary, this dissertation presents several novel methods related to image analysis and machine learning. In combination with recent advances of the field, they enable efficient computational analysis of very large collections of historical documents.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2017. , p. 95
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1475
Keyword [en]
document analysis, machine learning, image analysis, digital humanities, document dating, writer identification, text recognition
National Category
Computer Vision and Robotics (Autonomous Systems)
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:uu:diva-314211ISBN: 978-91-554-9814-6 (print)OAI: oai:DiVA.org:uu-314211DiVA, id: diva2:1071556
Public defence
2017-03-24, Tidskriftläsesalen, Carolina rediviva, Dag Hammarskjölds väg 1, Uppsala, 10:15 (English)
Opponent
Supervisors
Projects
q2b
Available from: 2017-03-02 Created: 2017-01-31 Last updated: 2018-01-13
List of papers
1. Data Mining Medieval Documents by Word Spotting
Open this publication in new window or tab >>Data Mining Medieval Documents by Word Spotting
2011 (English)In: Proceedings of the 2011 Workshop on Historical Document Imaging and Processing, New York: ACM , 2011, p. 75-82Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents novel results for word spotting based on dynamic time warping applied to medieval manuscripts in Latin and Old Swedish. A target word is marked by a user, and the method automatically finds similar word forms in the document by matching them against the target. The method automatically identifies pages and lines. We show that our method improves accuracy compared to earlier proposals for this kind of handwriting. An advantage of the new method is that it performs matching within a text line without presupposing that the difficult problem of segmenting the text line into individual words has been solved. We evaluate our word spotting implementation on two medieval manuscripts representing two script types. We also show that it can be useful by helping a user find words in a manuscript and present graphs of word statistics as a function of page number.

Place, publisher, year, edition, pages
New York: ACM, 2011
National Category
Humanities and the Arts Natural Sciences Language Technology (Computational Linguistics)
Research subject
Computational Linguistics; Computerized Image Processing
Identifiers
urn:nbn:se:uu:diva-162428 (URN)10.1145/2037342.2037355 (DOI)978-1-4503-0916-5 (ISBN)
Conference
Workshop on Historical Document Imaging and Processing, 16-17 Sep 2011, Beijing, China
Available from: 2011-11-30 Created: 2011-11-30 Last updated: 2018-01-12Bibliographically approved
2. Graph Based Line Segmentation on Cluttered Handwritten Manuscripts
Open this publication in new window or tab >>Graph Based Line Segmentation on Cluttered Handwritten Manuscripts
2012 (English)In: Proceedings of the 21st International Conference on Pattern Recognition, 2012, IEEE , 2012, p. 1570-1573Conference paper, Published paper (Refereed)
Abstract [en]

We propose a two phase line segmentationmethod for handwritten pre-modern densely writ-ten manuscripts. The proposed method combinesthe robustness of projection based methods withthe flexibility of graph based methods. The resultare cut-outs of the image containing each text line.Overlapping characters, help lines and degradationcan create foreground elements spanning several linesthat are hard to separate. We treat the problem offinding a cut through the text line separation as agraph optimization problem, which allows for flexibleseparation of entangled components.The proposed method has been tested on two me-dieval sources with satisfying results. A comparison tosimilar methods, using standard metrics, is presented.

Place, publisher, year, edition, pages
IEEE, 2012
National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:uu:diva-188588 (URN)978-1-4673-2216-4 (ISBN)
Conference
21st International Conference on Pattern Recognition (ICPR), 2012
Available from: 2012-12-17 Created: 2012-12-17 Last updated: 2018-01-11Bibliographically approved
3. Feature Weight Optimization and Pruning in Historical Text Recognition
Open this publication in new window or tab >>Feature Weight Optimization and Pruning in Historical Text Recognition
2013 (English)In: Advances of Visual Computing: 9th International Symposium, ISVC 2013, Rethymnon, Crete, Greece, July 29-31, 2013. Proceedings, Part II / [ed] George Bebis, Springer Berlin/Heidelberg, 2013, p. 98-107Conference paper, Published paper (Refereed)
Abstract [en]

In handwritten text recognition, "sliding window" feature extraction represent the visual information contained in written text as feature vector sequences. In this paper, we explore the parameter space of feature weights in search for optimal weights and feature selection using the coordinate descent method. We report a gain of about 5% AUC performance. We use a public dataset for evaluation and also discuss the effects and limitations of "word pruning," a technique in word spotting that is commonly used to boost performance and save computational time.

Place, publisher, year, edition, pages
Springer Berlin/Heidelberg, 2013
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 8034
Keyword
handwritten text recognition
National Category
Computer Vision and Robotics (Autonomous Systems)
Research subject
Computerized Image Analysis; Computerized Image Processing
Identifiers
urn:nbn:se:uu:diva-212536 (URN)10.1007/978-3-642-41939-3_10 (DOI)000335169000010 ()978-3-642-41939-3 (ISBN)978-3-642-41938-6 (ISBN)
Conference
9th International Symposium, ISVC 2013, July 29-31, 2013, Rethymnon, Crete, Greece
Projects
From Quill to Bytesq2bq2b_vr2012
Funder
Swedish Research Council, 2012-5743
Available from: 2013-12-11 Created: 2013-12-11 Last updated: 2018-01-11Bibliographically approved
4. Feature space denoising improves word spotting
Open this publication in new window or tab >>Feature space denoising improves word spotting
2013 (English)In: Proc. 2nd International Workshop on Historical Document Imaging and Processing, New York: ACM Press, 2013, p. 59-66Conference paper, Published paper (Refereed)
Abstract [en]

Some of the sliding window features commonly used in off-line handwritten text recognition are inherently noisy or sen-sitive to image noise. In this paper, we investigate the ef-fects of several de-noising filters applied in the feature spaceand not in the image domain. The purpose is to target theintrinsic noise of these features, stemming from the com-plex shapes of handwritten characters. This noise is presenteven if the image has been captured without any kind ofartefacts or noise. An evaluation, using a public database,is presented showing that the recognition of word-spottingcan be improved considerably by using de-noising filters inthe feature space.

Place, publisher, year, edition, pages
New York: ACM Press, 2013
Keyword
OCR, handwritten text recognition, filtering
National Category
Computer Vision and Robotics (Autonomous Systems)
Research subject
Computerized Image Processing
Identifiers
urn:nbn:se:uu:diva-206930 (URN)10.1145/2501115.2501118 (DOI)978-1-4503-2115-0 (ISBN)
Conference
2nd International Workshop on Historical Document Imaging and Processing
Projects
q2bq2b_vr2012
Funder
Swedish Research Council, 2012-5743
Available from: 2013-09-06 Created: 2013-09-06 Last updated: 2018-01-11Bibliographically approved
5. Spotting words in medieval manuscripts
Open this publication in new window or tab >>Spotting words in medieval manuscripts
2014 (English)In: Studia Neophilologica, ISSN 0039-3274, E-ISSN 1651-2308, Vol. 86, p. 171-186Article in journal (Refereed) Published
Abstract [en]

This article discusses the technology of handwritten text recognition (HTR) as a tool for the analysis of historical handwritten documents. We give a broad overview of this field of research, but the focus is on the use of a method called word spotting' for finding words directly and automatically in scanned images of manuscript pages. We illustrate and evaluate this method by applying it to a medieval manuscript. Word spotting uses digital image analysis to represent stretches of writing as sequences of numerical features. These are intended to capture the linguistically significant aspects of the visual shape of the writing. Two potential words can then be compared mathematically and their degree of similarity assigned a value. Our version of this method gives a false positive rate of about 30%, when the true positive rate is close to 100%, for an application where we search for very frequent short words in a 16th-Century Old Swedish cursiva recentior manuscript. Word spotting would be of use e.g. to researchers who want to explore the content of manuscripts when editions or other transcriptions are unavailable.

National Category
Computer and Information Sciences General Language Studies and Linguistics Language Technology (Computational Linguistics)
Research subject
Computational Linguistics
Identifiers
urn:nbn:se:uu:diva-227725 (URN)10.1080/00393274.2013.871975 (DOI)000335850200012 ()
Available from: 2014-01-20 Created: 2014-06-30 Last updated: 2018-01-11Bibliographically approved
6. Scribal Attribution using a Novel 3-D Quill-Curvature Feature Histogram
Open this publication in new window or tab >>Scribal Attribution using a Novel 3-D Quill-Curvature Feature Histogram
2014 (English)In: Proceedings International Conference on Frontiers in Handwriting Recognition (ICFHR), 2014, 2014Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we propose a novel pipeline forautomated scribal attribution based on the Quill feature: 1) Wecompensate the Quill feature histogram for pen changes andpage warping. 2) We add curvature as a third dimension in thefeature histogram, to better separate characteristics like loopsand lines. 3) We also investigate the use of several dissimilaritymeasures between the feature histograms. 4) We propose andevaluate semi-supervised learning for classification, to reducethe need of labeled samples.Our evaluation is performed on 1104 pages from a 15thcentury Swedish manuscript. It was chosen because it repre-sents a significant part of Swedish manuscripts of said period.Our results show that only a few percent of the materialneed labelling for average precisions above 95%. Our novelcurvature and registration extensions, together with semi-supervised learning, outperformed the current Quill feature.

Keyword
writer identification; semi-supervised learning; classification; historical manuscripts
National Category
Computer Sciences
Research subject
Computer Science; Computerized Image Processing
Identifiers
urn:nbn:se:uu:diva-238270 (URN)
Conference
The International Conference on Frontiers in Handwriting Recognition (ICFHR), September 1-4, 2014, Crete, Greece
Projects
q2bq2b_vr2012
Funder
Swedish Research Council, 2012-5743
Available from: 2014-12-11 Created: 2014-12-11 Last updated: 2018-05-03Bibliographically approved
7. Large scale style based dating of medieval manuscripts
Open this publication in new window or tab >>Large scale style based dating of medieval manuscripts
2015 (English)In: Proc. 3rd International Workshop on Historical Document Imaging and Processing, New York: ACM Press, 2015, p. 107-114Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
New York: ACM Press, 2015
National Category
Computer Vision and Robotics (Autonomous Systems)
Research subject
Computerized Image Processing
Identifiers
urn:nbn:se:uu:diva-261747 (URN)10.1145/2809544.2809560 (DOI)978-1-4503-3602-4 (ISBN)
Conference
HIP 2015, August 22, Nancy, France
Available from: 2015-08-22 Created: 2015-09-03 Last updated: 2018-01-11Bibliographically approved
8. Large scale continuous dating of medieval scribes using a combined image and language model
Open this publication in new window or tab >>Large scale continuous dating of medieval scribes using a combined image and language model
2016 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Finding the production date of a pre-modern manuscript is commonly a long process in historical research, requiring days of work from highly specialised experts. In this paper, we present an automatic dating method based on modelling both the language and the image data. By creating a statistical model over the changes in the pen strokes and short character sequences in the transcribed text, a combination of multiple estimators give a distribution over the time line for each manuscript. We have evaluated our estimation scheme on the medieval charter collection "Svenskt Diplomatariums huvudkartotek" (SDHK), including more than 5300 transcribed charters from the period 1135 - 1509. Our system is capable of achieving a median absolute error of 12 years, where the only human input is a transcription of the charter text. Since reading and transcribing the text is a skill that many researchers and students have, compared to the more specialized skill of dating medieval manuscripts based on palaeographical expertise, we find our novel approach suitable for helping individual researchers to date collections of manuscript pages. For larger collections, transcriptions could also be collected using crowd sourcing.

National Category
Computer Vision and Robotics (Autonomous Systems)
Research subject
Computerized Image Processing
Identifiers
urn:nbn:se:uu:diva-294882 (URN)10.1109/DAS.2016.71 (DOI)000390411200009 ()
Conference
12th IAPR International Workshop on Document Analysis Systems (DAS), APR 11-14, 2016, Greece
Projects
q2bq2b_vr2012
Funder
Swedish Research Council, 2012-5743
Available from: 2016-05-30 Created: 2016-05-30 Last updated: 2018-05-04Bibliographically approved
9. Historical Manuscript Production Date Estimation using Deep Convolutional Neural Networks
Open this publication in new window or tab >>Historical Manuscript Production Date Estimation using Deep Convolutional Neural Networks
2016 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Deep learning has thus far not been used for dating of pre-modern handwritten documents. In this paper, we propose ways of using deep convolutional neural networks (CNNs) to estimate production dates for such manuscripts. In our approach, a CNN can either be used directly for estimating the production date or as a feature learning framework for other regression techniques. We explore the feature learning approach using Gaussian Processes regression and Support Vector Regression.The evaluation is performed on a unique large dataset of over 10000 medieval charters from the Swedish collection Svenskt Diplomatariums huvudkartotek (SDHK). We show that deep learning is applicable to the task of dating documents and that the performance is on average comparable to that of a human expert.

Place, publisher, year, edition, pages
IEEE, 2016
Series
International Conference on Handwriting Recognition, ISSN 2167-6445
Keyword
Document analysis, Manuscripts, Document dating, Digital Humanities
National Category
Computer Vision and Robotics (Autonomous Systems)
Research subject
Computerized Image Processing
Identifiers
urn:nbn:se:uu:diva-306685 (URN)10.1109/ICFHR.2016.114 (DOI)000400052400039 ()978-1-5090-0981-7 (ISBN)
Conference
International Conference on Frontiers in Handwriting Recognition (ICFHR), October 23-26, 2016, Shenzhen, China.
Projects
q2bq2b_vr2012
Funder
Swedish Research Council, 2012-5743Riksbankens Jubileumsfond, NHS14-2068:1
Available from: 2016-11-01 Created: 2016-11-01 Last updated: 2018-04-04

Open Access in DiVA

fulltext(2033 kB)231 downloads
File information
File name FULLTEXT01.pdfFile size 2033 kBChecksum SHA-512
d8e067bdf27dc425f94f4608c78d9a3e6bf7cbbcd5b8d9073b4498b3e3a4af9586b3e2829114ef23fbe88d5f5436d734b4d52ebe43dd79d1f4876242835e4b9d
Type fulltextMimetype application/pdf
Buy this publication >>

Search in DiVA

By author/editor
Wahlberg, Fredrik
By organisation
Division of Visual Information and InteractionComputerized Image Analysis and Human-Computer Interaction
Computer Vision and Robotics (Autonomous Systems)

Search outside of DiVA

GoogleGoogle Scholar
Total: 231 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 3006 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf