Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Duplicate Detection and Text Classification on Simplified Technical English
Linköping University, Department of Computer and Information Science.
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Dublettdetektion och textklassificering på Förenklad Teknisk Engelska (Swedish)
Abstract [en]

This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.

Place, publisher, year, edition, pages
2019. , p. 62
Keywords [en]
NLP, CNL, transformer models, LSTM, BERT, document embeddings, word embeddings, text classification, text clustering, transfer learning, machine learning
National Category
Natural Language Processing Computer Sciences
Identifiers
URN: urn:nbn:se:liu:diva-158714ISRN: LIU-IDA/LITH-EX-A--19/033--SEOAI: oai:DiVA.org:liu-158714DiVA, id: diva2:1337383
External cooperation
Etteplan
Subject / course
Computer science
Presentation
2019-06-12, Alan Turing, Linköpings Universitet, Linköping, 10:00 (English)
Supervisors
Examiners
Available from: 2019-08-13 Created: 2019-07-14 Last updated: 2025-02-01Bibliographically approved

Open Access in DiVA

fulltext(1862 kB)1851 downloads
File information
File name FULLTEXT01.pdfFile size 1862 kBChecksum SHA-512
d8bc13003822669c5f75c4a06e527957cd2bb907748f6c173256cccb0e7e718ff8a0710c6f4e7c13a6d5de8cad975620682840cfebd2ef2cc657ba788c8299e8
Type fulltextMimetype application/pdf

By organisation
Department of Computer and Information Science
Natural Language ProcessingComputer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 1851 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 2989 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf