Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Clustering unstructured life sciences experiments with unsupervised machine learning: Natural language processing for unstructured life sciences texts
KTH, School of Electrical Engineering and Computer Science (EECS).
2019 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Klustrering av naturvetenskapliga experiment med oövervakad maskininlärning. (Swedish)
Abstract [en]

The purpose of this master’s thesis is to analyse different types of document representations in the context of improving, in an unsupervised manner, the searchability of unstructured textual life sciences experiments by clustering similar experiments together. The challenge is to produce, analyse and compare different representations of the life sciences data by using traditional and advanced unsupervised Machine learning models. The text data analysed in this work is noisy and very heterogeneous, as it comes from a real-world Electronic Lab Notebook.

Clustering unstructured and unlabeled text experiments is challenging. It requires the creation of representations based only on the relevant information existing in an experiment. This work studies statistical and generative techniques, word embeddings and some of the most recent deep learning models in Natural Language Processing to create the various representation of the studied data. It explores the possibility of combining multiple techniques and using external life-sciences knowledge-bases to create richer representations before applying clustering algorithms. Different types of analysis are performed, including an assessment done by experts, to evaluate and compare the scientific relevance of the cluster of experiments created by the different data representations. The results show that traditional statistical techniques can still produce good baselines. Modern deep learning techniques have been shown to model the studied data well and create rich representations. Combining multiple techniques with external knowledge (biomedical and life-science-related ontologies) have been shown to produce the best results in grouping similar relevant experiments together. The different studied techniques enable to model different, and complementary aspects of a text, therefore combining them is a key to significantly improve the clustering of unstructured data.

Abstract [sv]

Syftet med denna uppsats är att analysera olika typer av dokumentrepresentationer för att, på ett oövervakat sätt, förbättra sökbarheten hos ostrukturerade biomedicinska experiment genom att kluster-samla liknande experiment tillsammans. Arbetet innefattar att producera, analysera och jämföra textrepresenta- tioner med hjälp av olika traditionella och moderna maskininlärningsmetoder. Den data som analyserats är brusig och heterogen eftersom den kommer från manuellt skrivna experiment från ett elektroniskt labbokssystem.

Att kluster-indela ostrukturerade och oannoterade experiment är en utmaning. Det kräver en representation av texten som enbart baseras på väsentlig information. I denna uppsats har statistiska och generativa tekniker som inbäddade ord samt de senaste framstegen inom djup maskininlärning inom området naturlig textbearbetning använts för att skapa olika textrepresentationer. Genom att kombinera olika tekniker samt att utnyttja externa biomedicinska kunskapskällor har möjligheten att skapa en bättre representation undersökts. Flera analyser har gjorts och dessa har kompletterats med en manuell utvärdering utförd av experter inom det biomedicinska kunskapsfältet.

Resultatet visar att traditionella statistiska metoder kan skapa en rimlig basnivå. Moderna djupinlärningsalgoritmer har också visat sig fungera mycket väl och skapat rika representationer av innehållet. Kombinationer av flera tekniker samt användningen av externa biomedicinska kunskapskällor och ontologier har visat sig ge bäst resultat. De olika teknikerna verkar modellera olika och komplementära aspekter av en text, och att kombinera dem kan vara en nyckel till att signifikant förbättra sökbarheten hos ostrukturerad text.

Place, publisher, year, edition, pages
2019. , p. 90
Series
TRITA-EECS-EX ; 2019:684
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-265549OAI: oai:DiVA.org:kth-265549DiVA, id: diva2:1378586
Supervisors
Examiners
Available from: 2020-01-29 Created: 2019-12-13 Last updated: 2020-01-29Bibliographically approved

Open Access in DiVA

fulltext(12630 kB)12 downloads
File information
File name FULLTEXT01.pdfFile size 12630 kBChecksum SHA-512
7097a350dc07834b0b8cd6002ed136ad5ec675aef699a00273c58a73696d979bf96e2c35b871e9ed54a1d43e66edb4e011c820506ff84c4e5cb58d03fcf3f2e8
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 12 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 69 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf