Change search
ReferencesLink to record
Permanent link

Direct link
A graphotactic language metric
KTH, School of Engineering Sciences (SCI), Mathematics (Dept.).
2013 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

In this bachelor’s thesis, we try to classify and identify written human

languages by studying the ordering of letters in text. Automatic

language identification is of interest in areas such as text indexing,

machine translation and natural language parsing.

Eleven written languages which use the Latin alphabet are considered

and modelled with a Markov chain on the letter level. Texts

from the New Testament and Wikipedia are used as training data.

The distances between the languages are then measured by using a

matrix-based metric on the transition matrices, and visualized in a

dendrogram. A probability-based distance measure is also used.

The matrix-based metric is then applied to language identification

by creating a transition matrix for the text whose language is to

be identified, and comparing the distances from this matrix to those

of the known languages; the shortest distance indicates the language

of the text. This is compared with maximum-likelihood classification.

We compare metrics based on different matrix norms, and also

study how the order of the Markov chains and the size of the training

data and sample texts for language identification influence the


The results indicate that the choice of matrix norm is important

and that the Frobenius norm and the 1-norm are the best norms

for language classification and language identification. Using these,

it is possible to generate satisfactory dendrograms, and accurately

identify the language of reasonably large texts. On the other hand,


1-norm cannot be recommended in this context; an explanation

is given for its bad performance.

Some languages are easier to classify correctly than others; the

Scandinavian languages are easy to group together, as are Spanish,

Portuguese and Italian. However, English, French, German and

Finnish are harder to classify correctly.


Written human languages, Language classification, Language

identification, Markov chain model, Matrix norms, Statistical

analysis of text.

Place, publisher, year, edition, pages
2013. , 70 p.
National Category
Engineering and Technology
URN: urn:nbn:se:kth:diva-128781OAI: diva2:648569
Educational program
Master of Science in Engineering -Engineering Physics
Available from: 2013-09-16 Created: 2013-09-16 Last updated: 2013-09-16Bibliographically approved

Open Access in DiVA

Joar Bagge kandidatex(995 kB)121 downloads
File information
File name FULLTEXT01.pdfFile size 995 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
Mathematics (Dept.)
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 121 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 115 hits
ReferencesLink to record
Permanent link

Direct link