Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Basic Language Resource Kit for Persian
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (computational linguistics)
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (computatonal linguistics)
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Languages, Department of Linguistics and Philology. (computational linguistics)
2012 (English)In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12) / [ed] Nicoletta Calzolari et al., European Language Resources Association, 2012, p. 2245-2252Conference paper, Published paper (Refereed)
Abstract [en]

Persian with its about 100,000,000 speakers in the world belongs to the group of languages with less developed linguistically annotated resources and tools. The few existing resources and tools are neither open source nor freely available. Thus, our goal is to develop open source resources such as corpora and treebanks, and tools for data-driven linguistic analysis of Persian. We do this by exploring the reusability of existing resources and adapting state-of-the-art methods for the linguistic annotation. We present fully functional tools for text normalization, sentence segmentation, tokenization, part-of-speech tagging, and parsing. As for resources, we describe the Uppsala PErsian Corpus (UPEC) which is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization modified for more appropriate syntactic annotation. The corpus consists of 2,782,109 tokens and is annotated with parts of speech and morphological features. A treebank is derived from UPEC with an annotation scheme based on Stanford Typed Dependencies and is planned to consist of 10,000 sentences of which 215 have already been annotated. 

Place, publisher, year, edition, pages
European Language Resources Association, 2012. p. 2245-2252
Keyword [en]
BLARK, tagging, parsing, Farsi, Persian corpus, Persian treebank
National Category
Humanities
Research subject
Computational Linguistics
Identifiers
URN: urn:nbn:se:uu:diva-189428ISI: 000323927702052ISBN: 978-2-9517408-7-7 (print)OAI: oai:DiVA.org:uu-189428DiVA, id: diva2:581398
Conference
Eight International Conference on Language Resources and Evaluation (LREC 2012), 23-25 May 2012, Istanbul, Turkey
Available from: 2013-01-09 Created: 2013-01-01 Last updated: 2017-01-25Bibliographically approved

Open Access in DiVA

fulltext(347 kB)448 downloads
File information
File name FULLTEXT02.pdfFile size 347 kBChecksum SHA-512
6eddf9f6783f07da002cb7cdda523e33d1145cdd9d21776c6f4066d7e1326a25fe8e8a0a2ea388a6d2e49aa4a861f35c57e2680ea263ab8b8b29bfa18f64abb1
Type fulltextMimetype application/pdf

Other links

http://www.lrec-conf.org/proceedings/lrec2012/pdf/338_Paper.pdf

Search in DiVA

By author/editor
Seraji, MojganMegyesi, Beata
By organisation
Department of Linguistics and Philology
Humanities

Search outside of DiVA

GoogleGoogle Scholar
Total: 448 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 815 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf