A Basic Language Resource Kit for Persian
2012 (English)In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12) / [ed] Nicoletta Calzolari et al., European Language Resources Association, 2012, 2245-2252 p.Conference paper (Refereed)
Persian with its about 100,000,000 speakers in the world belongs to the group of languages with less developed linguistically annotated resources and tools. The few existing resources and tools are neither open source nor freely available. Thus, our goal is to develop open source resources such as corpora and treebanks, and tools for data-driven linguistic analysis of Persian. We do this by exploring the reusability of existing resources and adapting state-of-the-art methods for the linguistic annotation. We present fully functional tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and parsing. As for resources, we describe the Uppsala PErsian Corpus (UPEC) which is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization modified for more appropriate syntactic annotation. The corpus consists of 2,782,109 tokens and is annotated with parts of speech and morphological features. A treebank is derived from UPEC with an annotation scheme based on Stanford Typed Dependencies and is planned to consist of 10,000 sentences of which 215 have already been annotated.
Place, publisher, year, edition, pages
European Language Resources Association, 2012. 2245-2252 p.
BLARK, tagging, parsing, Farsi, Persian corpus, Persian treebank
Research subject Computational Linguistics
IdentifiersURN: urn:nbn:se:uu:diva-189428ISI: 000323927702052ISBN: 978-2-9517408-7-7OAI: oai:DiVA.org:uu-189428DiVA: diva2:581398
Eight International Conference on Language Resources and Evaluation (LREC 2012), 23-25 May 2012, Istanbul, Turkey