Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Clustering metagenome contigs using coverage with CONCOCT
KTH, School of Computer Science and Communication (CSC).
2017 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Klustring av metagenom-kontiger baserat på abundans-profiler med CONCOCT (Swedish)
Abstract [en]

Metagenomics allows studying genetic potentials of microorganisms without prior cultivation. Since metagenome assembly results in fragmented genomes, a key challenge is to cluster the genome fragments (contigs) into more or less complete genomes.

The goal of this project was to investigate how well CONCOCT bins assembled contigs into taxonomically relevant clusters using the abundance profiles of the contigs over multiple samples. This was done by studying the effects of different parameter settings for CONCOCT on the clustering results when clustering metagenome contigs from in silico model communities generated by mixing data from isolate genomes. These parameters control how the model that CONCOCT trains is tuned and then how the model fits contigs to their cluster.

Each parameter was tested in isolation while others were kept at their default values. For each of the data set used, the number of clusters was kept constant at the known number of species and strains in their respective data set.

The resulting configuration was to use a tied covariance model, using principal components explaining 90% of the variance, and filtering out contigs shorter than 3000 bp. It also suggested that all available samples should be used for the abundance profiles.

Using these parameters for CONCOCT, it was executed to have it estimate the number of clusters automatically. This gave poor results which lead to the conclusion that the process for selecting the number of clusters that was implemented in CONCOCT, “Bayesian Information Criterion”, was not good enough. That led to the testing of another similar mathematical model, “Dirichlet Process Gaussian Mixture Model”, that uses a different algorithm to estimate number of clusters. This new model gave much better results and CONCOCT has adapted a similar model in later versions.

Abstract [sv]

Metagenomik möjliggör analys av arvsmassor i mikrobiella floror utan att först behöva odla mikroorgansimerna. Metoden innebär att man läser korta DNA-snuttar som sedan pusslas ihop till längre genomfragment (kontiger). Genom att gruppera kontiger som härstammar från samma organism kan man sedan återskapa mer eller mindre fullständiga genom, men detta är en svår bioinformatisk utmaning.

Målsättningen med det här projektet var att utvärdera precisionen med vilken mjukvaran CONCOCT, som vi nyligen utvecklat, grupperar kontiger som härstammar från samma organism baserat på information om kontigernas sekvenskomposition och abundansprofil över olika prover. Vi testade hur olika parametrar påverkade klustringen av kontiger i artificiella metagenomdataset av olika komplexitet som vi skapade in silico genom att blanda data från tidigare sekvenserade genom. Parametrarna som testades rörde indata såväl som den statistiska modell som CONCOCT använder för att utföra klustringen. Parametrarna varierades en i taget medan de andra parametrarna hölls konstanta. Antalet kluster hölls också konstant och motsvarade antalet olika organismer i flororna. Bäst resultat erhölls då vi använde en låst kovariansmodell och använde principalkomponenter som förklarade 90% av variansen, samt filtrerade bort kontiger som var kortare än 3000 baspar. Vi fick också bäst resultat då vi använde alla tillgängliga prover.

Därefter använde vi dessa parameterinställningar och lät CONCOCT själv bestämma lämpligt antal kluster i dataseten med “Bayesian Information Criterion” - metoden som då var implementerad i CONCOCT. Detta gav otillfredsställande resultat med i regel för få och för stora kluster. Därför testade vi en alternativ metod, “Dirichlet Process Gaussian Mixture Model”, för att uppskatta antal kluster. Denna metod gav avsevärt bättre resultat och i senare versioner av CONCOCT har en liknande metod implementerats.

Place, publisher, year, edition, pages
2017. , p. 39
Keywords [en]
metagenomics, coverage, composition, k-mer, clustering, mixture model, gaussian, dirichlet process, machine learning
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-208944OAI: oai:DiVA.org:kth-208944DiVA, id: diva2:1109071
External cooperation
SciLifeLab
Subject / course
Computer Science
Educational program
Master of Science - Computational and Systems Biology
Supervisors
Examiners
Available from: 2017-10-02 Created: 2017-06-13 Last updated: 2018-01-13Bibliographically approved

Open Access in DiVA

fulltext(2188 kB)51 downloads
File information
File name FULLTEXT01.pdfFile size 2188 kBChecksum SHA-512
a9692beec02b9973625ab4114ab464445ca9e87bfbb7df6ac0c08286b100b9bc7395fd3f7ea1f9f5319df3c7ff7e65112e516506e10a615187bd93b202f92e8b
Type fulltextMimetype application/pdf

By organisation
School of Computer Science and Communication (CSC)
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 51 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 151 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf