Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Computational Terminology: Exploring Bilingual and Monolingual Term Extraction
Linköping University, Department of Computer and Information Science, NLPLAB - Natural Language Processing Laboratory. Linköping University, The Institute of Technology.
2012 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

Terminologies are becoming more important to modern day society as technology and science continue to grow at an accelerating rate in a globalized environment. Agreeing upon which terms should be used to represent which concepts and how those terms should be translated into different languages is important if we wish to be able to communicate with as little confusion and misunderstandings as possible.

Since the 1990s, an increasing amount of terminology research has been devoted to facilitating and augmenting terminology-related tasks by using computers and computational methods. One focus for this research is Automatic Term Extraction (ATE).

In this compilation thesis, studies on both bilingual and monolingual ATE are presented. First, two publications reporting on how bilingual ATE using the align-extract approach can be used to extract patent terms. The result in this case was 181,000 manually validated English-Swedish patent terms which were to be used in a machine translation system for patent documents. A critical component of the method used is the Q-value metric, presented in the third paper, which can be used to rank extracted term candidates (TC) in an order that correlates with TC precision. The use of Machine Learning (ML) in monolingual ATE is the topic of the two final contributions. The first ML-related publication shows that rule induction based ML can be used to generate linguistic term selection patterns, and in the second ML-related publication, contrastive n-gram language models are used in conjunction with SVM ML to improve the precision of term candidates selected using linguistic patterns.

Place, publisher, year, edition, pages
Linköping: Linköping University Electronic Press, 2012. , 68 p.
Series
Linköping Studies in Science and Technology. Thesis, ISSN 0280-7971 ; 1523
Keyword [en]
terminology, automatic term extraction, automatic term recognition, computational terminology, terminology management
National Category
Language Technology (Computational Linguistics)
Identifiers
URN: urn:nbn:se:liu:diva-75243ISBN: 978-91-7519-944-3 (print)OAI: oai:DiVA.org:liu-75243DiVA: diva2:508245
Presentation
2012-04-04, Alan Turing, Hus E, Campus Valla, Linköpings universitet, Linköping, 13:15 (English)
Opponent
Supervisors
Available from: 2012-03-07 Created: 2012-02-23 Last updated: 2012-03-07Bibliographically approved
List of papers
1. Computer aided term bank creation and standardization: Building standardized term banks through automated term extraction and advanced editing tools
Open this publication in new window or tab >>Computer aided term bank creation and standardization: Building standardized term banks through automated term extraction and advanced editing tools
2010 (English)In: Terminology in Everyday Life / [ed] Marcel Thelen and Frieda Steurs, John Benjamins Publishing Company , 2010, 163-180 p.Chapter in book (Other academic)
Abstract [en]

Using a standardized term bank in both authoring and translation processes can facilitate the use of consistent terminology, which in turn minimizes confusion and frustration from the readers. One of the problems of creating a standardized term bank, is the time and effort required. Recent developments in term extraction techniques based on word alignment can improve extraction of term candidates when parallel texts are available. The aligned units are processed automatically, but a large quantity of term candidates will still have to be processed by a terminologist to select which candidates should be promoted to standardized terms. To minimize the work needed to process the extracted term candidates, we propose a method based on using efficient editing tools, as well as ranking the extracted set of term candidates by quality. This sorted set of term candidates can then be edited, categorized and filtered in a more effective way. In this paper, the process and methods used to arrive at a standardized term bank are presented and discussed.

 

Place, publisher, year, edition, pages
John Benjamins Publishing Company, 2010
Series
Terminology and Lexicography Research and Practice, ISSN 1388-8455 ; 13
Keyword
terminology, extraction, term bank, automation
National Category
Language Technology (Computational Linguistics) Computer Science
Identifiers
urn:nbn:se:liu:diva-59842 (URN)978 90 272 2337 1 (ISBN)
Available from: 2010-09-27 Created: 2010-09-27 Last updated: 2013-04-12Bibliographically approved
2. Automatic Extraction and Manual Validation of Hierarchical Patent Terminology
Open this publication in new window or tab >>Automatic Extraction and Manual Validation of Hierarchical Patent Terminology
Show others...
2009 (English)In: NORDTERM 16. Ontologier og taksonomier.: Rapport fra NORDTERM 2009 / [ed] B. Nistrup Madsen & H. Erdman Thomsen, Copenhagen, Denmark: Copenhagen Business School Press, 2009, 249-262 p.Conference paper, Published paper (Refereed)
Abstract [en]

Several methods can be applied to create a set of validated terms from existing documents. In this paper we describe an automatic bilingual term candidate extraction method, and the validation process used to create a hierarchical patent terminology. The process described was used to extract terms from patent texts, commissioned by the Swedish Patent Office with the purpose of using the terms for machine translation. Information on the correct linguistic inflection patterns and hierarchical partitioning of terms based on their use are of utmost importance.The process contains six phases, 1) Analysis of the source material and system configuration; 2) Term candidate extraction; 3) Term candidate filtering and initial linguistic validation; 4) Manual validation by domain experts; 5) Final linguistic validation; and 6) Publishing the validated terms.Input to the extraction process consisted of more than 91 000 patent document pairs in English and Swedish, 565 million words in English and 450 million words in Swedish. The English documents were supplied in EBD SGML format and the Swedish documents were supplied in OCR processed scans of patent documents. After grammatical and statistical analysis, the documents were word-aligned. Using the word-aligned material, candidate terms were extracted based on linguistic patterns. 750 000 term candidates were extracted and stored in a relational database. The term candidates were processed in 8 months resulting in 181 000 unique validated term pairs that were exported into several hierarchically organized OLIF files.

Place, publisher, year, edition, pages
Copenhagen, Denmark: Copenhagen Business School Press, 2009
Keyword
automatic term extraction, computational terminology, patent terminology
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:liu:diva-75236 (URN)978-87-994577-0-0 (ISBN)
Conference
NORDTERM 2009, København, Danmark 9‐12. juni 2009
Available from: 2012-02-23 Created: 2012-02-22 Last updated: 2012-03-07Bibliographically approved
3. Terminology extraction and term ranking for standardizing term banks
Open this publication in new window or tab >>Terminology extraction and term ranking for standardizing term banks
2007 (English)In: Proceedings of 16th Nordic Conference of Computational Linguistics Nodalida,2007 / [ed] Joakim Nivre, Heiki-Jaan Kaalep, Kadri Muischnek and Mare Koit, Tartu, Estonia: University of Tartu , 2007, 349-354 p.Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents how word alignment techniques could be used for building standardized term banks. It is shown that time and effort could be saved by a relatively simple evaluation metric based on frequency data from term pairs, and source and target distributions inside the alignment results. The proposed Q-value metric is shown to outperform other tested metrics such as Dice's coefficient, and simple pair frequency.

 

Place, publisher, year, edition, pages
Tartu, Estonia: University of Tartu, 2007
Keyword
terminology extraction, metric, word alignment
National Category
Computer Science
Identifiers
urn:nbn:se:liu:diva-41011 (URN)54924 (Local ID)978-9985-4-0513-0 (ISBN)54924 (Archive number)54924 (OAI)
Conference
NODALIDA 2007, 16th Nordic Conference of Computational Linguistics, 24-26 May 2007, University of Tartu, Estonia
Available from: 2010-09-29 Created: 2009-10-10 Last updated: 2012-03-07Bibliographically approved
4. Using machine learning to perform automatic term recognition
Open this publication in new window or tab >>Using machine learning to perform automatic term recognition
2010 (English)In: Proceedings of the LREC 2010 Workshop on Methods for automatic acquisition of Language Resources and their evaluation methods / [ed] Núria Bel, Béatrice Daille, Andrejs Vasiljevs, European Language Resources Association, 2010, 49-54 p.Conference paper, Published paper (Refereed)
Abstract [en]

In this paper a machine learning approach is applied to Automatic Term Recognition (ATR). Similar approaches have been successfully used in Automatic Keyword Extraction (AKE). Using a dataset consisting of Swedish patent texts and validated terms belonging to these texts, unigrams and bigrams are extracted and annotated with linguistic and statistical feature values. Experiments using a varying ratio between positive and negative examples in the training data are conducted using the annotated n-grams. The results indicate that a machine learning approach is viable for ATR. Furthermore, a machine learning approach for bilingual ATR is discussed. Preliminary analysis however indicate that some modifications have to be made to apply the monolingual machine learning approach to a bilingual context.

Place, publisher, year, edition, pages
European Language Resources Association, 2010
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:liu:diva-75237 (URN)000356879501100 ()978-2-9517408-6-0 (ISBN)
Conference
LREC 2010 Workshop on Methods for automatic acquisition of Language Resources and their evaluation methods, 23 May 2010, Valletta, Malta
Available from: 2012-03-01 Created: 2012-02-22 Last updated: 2017-01-23Bibliographically approved
5. Exploring termhood using language models
Open this publication in new window or tab >>Exploring termhood using language models
2011 (English)In: Proceedings of the Workshop CHAT 2011: Creation, Harmonization and Application of Terminology Resources / [ed] Tatiana Gornostay, Andrejs Vasiljevs, Tartu University Library (Estonia): Northern European Association for Language Technology (NEALT) , 2011, 32-35 p.Conference paper, Published paper (Refereed)
Abstract [en]

Term extraction metrics are mostly based on frequency counts. This can be a problem when trying to extract previously unseen multi-word terms. This paper explores whether smoothed language models can be used instead. Although a simplistic use of language models is examined in this paper, the results indicate that with more refinement, smoothed language models may be used instead of unsmoothed frequency-count based termhood metrics.

Place, publisher, year, edition, pages
Tartu University Library (Estonia): Northern European Association for Language Technology (NEALT), 2011
Series
NEALT Proceedings Series, ISSN 1736-8197, E-ISSN 1736-6305 ; Vol. 12
Keyword
automatic term extraction, computational terminology, machine learning
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:liu:diva-75238 (URN)
Conference
NODALIDA 2011 Workshop Creation, Harmonization and Application of Terminology Resources, May 11, 2011, Riga, Latvia
Available from: 2012-02-23 Created: 2012-02-22 Last updated: 2017-02-21Bibliographically approved

Open Access in DiVA

Computational Terminology: Exploring Bilingual and Monolingual Term Extraction(794 kB)738 downloads
File information
File name FULLTEXT01.pdfFile size 794 kBChecksum SHA-512
f96f59ff271a433bb9e0524e8eaeb439a2bcdcbf98d6023a85635db2ac89c00bd0f7564b1cbf8bd20f169a8bc40283ce74094af2c862212e5e941ab5a79f1f61
Type fulltextMimetype application/pdf
omslag(164 kB)97 downloads
File information
File name COVER01.pdfFile size 164 kBChecksum SHA-512
a25bd09c9f1f8cc7d2f8d58274a8790af3bfe2657b9271b6793707c63f02c7bb9df3451d3081089da2e45725c368b99a5efab06a369fa20401eee98281364a78
Type coverMimetype application/pdf

Search in DiVA

By author/editor
Foo, Jody
By organisation
NLPLAB - Natural Language Processing LaboratoryThe Institute of Technology
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 738 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1160 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf