Duplicate Detection and Text Classification on Simplified Technical English
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
Student thesisAlternative title
Dublettdetektion och textklassificering på Förenklad Teknisk Engelska (Swedish)
Abstract [en]
This thesis investigates the most effective way of performing classification of text labels and clustering of duplicate texts in technical documentation written in Simplified Technical English. Pre-trained language models from transformers (BERT) were tested against traditional methods such as tf-idf with cosine similarity (kNN) and SVMs on the classification task. For detecting duplicate texts, vector representations from pre-trained transformer and LSTM models were tested against tf-idf using the density-based clustering algorithms DBSCAN and HDBSCAN. The results show that traditional methods are comparable to pre-trained models for classification, and that using tf-idf vectors with a low distance threshold in DBSCAN is preferable for duplicate detection.
Place, publisher, year, edition, pages
2019. , p. 62
Keywords [en]
NLP, CNL, transformer models, LSTM, BERT, document embeddings, word embeddings, text classification, text clustering, transfer learning, machine learning
National Category
Natural Language Processing Computer Sciences
Identifiers
URN: urn:nbn:se:liu:diva-158714ISRN: LIU-IDA/LITH-EX-A--19/033--SEOAI: oai:DiVA.org:liu-158714DiVA, id: diva2:1337383
External cooperation
Etteplan
Subject / course
Computer science
Presentation
2019-06-12, Alan Turing, Linköpings Universitet, Linköping, 10:00 (English)
Supervisors
Examiners
2019-08-132019-07-142025-02-01Bibliographically approved