Change search
ReferencesLink to record
Permanent link

Direct link
Morphosyntactic Corpora and Tools for Persian
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (Computational Linguistics)
2015 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

This thesis presents open source resources in the form of annotated corpora and modules for automatic morphosyntactic processing and analysis of Persian texts. More specifically, the resources consist of an improved part-of-speech tagged corpus and a dependency treebank, as well as tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and dependency parsing for Persian.

In developing these resources and tools, two key requirements are observed: compatibility and reuse. The compatibility requirement encompasses two parts. First, the tools in the pipeline should be compatible with each other in such a way that the output of one tool is compatible with the input requirements of the next. Second, the tools should be compatible with the annotated corpora and deliver the same analysis that is found in these. The reuse requirement means that all the components in the pipeline are developed by reusing resources, standard methods, and open source state-of-the-art tools. This is necessary to make the project feasible.

Given these requirements, the thesis investigates two main research questions. The first is how can we develop morphologically and syntactically annotated corpora and tools while satisfying the requirements of compatibility and reuse? The approach taken is to accept the tokenization variations in the corpora to achieve robustness. The tokenization variations in Persian texts are related to the orthographic variations of writing fixed expressions, as well as various types of affixes and clitics. Since these variations are inherent properties of Persian texts, it is important that the tools in the pipeline can handle them. Therefore, they should not be trained on idealized data.

The second question concerns how accurately we can perform morphological and syntactic analysis for Persian by adapting and applying existing tools to the annotated corpora. The experimental evaluation of the tools shows that the sentence segmenter and tokenizer achieve an F-score close to 100%, the tagger has an accuracy of nearly 97.5%, and the parser achieves a best labeled accuracy of over 82% (with unlabeled accuracy close to 87%).

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2015. , 191 p.
Studia Linguistica Upsaliensia, ISSN 1652-1366 ; 16
Keyword [en]
Persian, language technology, corpus, treebank, preprocessing, segmentation, part-of-speech tagging, dependency parsing
National Category
Language Technology (Computational Linguistics) Engineering and Technology
Research subject
Computational Linguistics
URN: urn:nbn:se:uu:diva-248780ISBN: 978-91-554-9229-8OAI: diva2:800998
Public defence
2015-05-27, Universitetshuset / IX, Uppsala, 10:15 (English)
Available from: 2015-05-06 Created: 2015-04-08 Last updated: 2015-07-07Bibliographically approved

Open Access in DiVA

fulltext(12975 kB)1632 downloads
File information
File name FULLTEXT02.pdfFile size 12975 kBChecksum SHA-512
Type fulltextMimetype application/pdf
Buy this publication >>

Search in DiVA

By author/editor
Seraji, Mojgan
By organisation
Department of Linguistics and Philology
Language Technology (Computational Linguistics)Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 1632 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 6532 hits
ReferencesLink to record
Permanent link

Direct link