Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Accurate and fast taxonomic profiling of microbial communities
KTH, School of Electrical Engineering (EES), Communication Theory.
2015 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

With the advent of next generation sequencing there has been an explosion

of the size of data that needs to be processed, where next generation

sequencing yields basepairs of DNA in the millions. The rate at

which the size of data increases supersedes Moores law therefore there is

a huge demand for methods to nd meaningful labels of sequenced data.

Studies of microbial diversity of a sample is one such challenge in the eld

of metagenomics. Finding the distribution of a bacterial community has

many uses for example, obesity control. Existing methods often resort to

read-by-read classication which can take several days of computing time

in a regular desktop environment, excluding genomic scientists without

access to huge clusters of computational units.

By using sparsity enforcing methods from the general sparse signal processing

eld (such as compressed sensing), solutions have been found to

the bacterial community composition estimation problem by a simultaneous

assignment of all sample reads to a pre-processed reference database.

The inference task is reduced to a general statistical model based on

kernel density estimation techniques that are solved by existing convex

optimization tools. The objective is to o er a reasonably fast community

composition estimation method. This report proposes, clustering as

a means of aggregating data to improve existing techniques run-time and

biological delity. Use of convex optimization tools to increase the accuracy

of mixture model parameters are also explored and tested. The

work is concluded by experimentation on proposed improvements with

satisfactory results.

The use of Dirichlet mixtures is explored as a parametric model of

the sample distribution where it is deemed that the Dirichlet is a good

choice for aggregation of k-mer feature vectors but the use of Expectation

Maximization is unt for parameter estimation of bacterial 16s rRNA

samples.

Finally, a semi-supervised learning method found on distance based

classication of taxa has been implemented and tested on real biological

data with high biological delity.

Abstract [sv]

Nya tekniker inom DNA-sekvensering har givit upphov till en explosion

pa data som nns att tillga. Nasta generations DNA-sekvensering

generar baspar som stracker sig i miljonerna och mangden data okas i en

exponentiell takt, vilket ar varfor det nns ett stort behov av ny skalbar

metodik som kan analysera kvantitiv data for att fa ut relevant information.

Den bakteriella artfordelning av ett provror ar en sadan problemst

allning inom meta-genomik, vilket har era tillampningsomraden

som exempelvis, studier av fettma. I dagslaget sa ar den vanligaste metoden

for att fa ut artfordelningen genom att klassiera DNA-strangarna av

bakterierna, vilket ar en tidskravande losning som kan ta upp emot ett

dygn for att processera data med hog upplosning. En snabb och tillforlitlig

losning skulle darfor tillata er forskare att ta del av nasta generations

sekvensering och analysera dess data som i sin tur skulle ge upphov till

mer innovation inom omradet.

Alternativa losningar med inspiration fran signalbehandlig har hittats

som nyttjar problemestallningens glesa natur genom anvandning av Compressed

Sensing. Svar hittas genom att simultant tilldela strangar till en

for-processerad referensdatabas. Problemstallningen har forenklats till en

statistisk modell av provror med ickeparametrisk estimering for att implicit

fa ut fordelningen av bakteriearter med hjalp av konvex optimering.

Denna rapport foreslar anvandningen av klustrering for aggregering

av data for att forbattra tillforlitligheten av svaren och minska tiden for

berakning av dessa. Anvandningen av parametriska modeller, Dirichlet

fordelningen, har utforskats dar rapporten har kommit fram till att antaganden

for lampligheten av denna som ett medel att aggregera k-mer vektorer

~Ar rimliga men att parameterestimeringen med Expectation Maximization

ej fungerar val i samband med Dirichlet och en omskrivning

av parametern skulle behovas i vektorrymden som spans av 16S rRNA

genen.

Slutligen sa har distansbaserad tilldelning av bakterier testats pa data

fran verklig biologisk kontext med valdigt hog noggranhet.

ii

Place, publisher, year, edition, pages
2015. , 50 p.
Series
EES Examensarbete / Master Thesis, XR-EE-KT 2015:001
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
URN: urn:nbn:se:kth:diva-162919OAI: oai:DiVA.org:kth-162919DiVA: diva2:798057
Educational program
Master of Science - Wireless Systems
Presentation
2015-02-23, SIP conference room, Osquldas väg 10, Floor 3, Stockholm, 15:25 (English)
Supervisors
Examiners
Available from: 2015-03-31 Created: 2015-03-25 Last updated: 2015-03-31Bibliographically approved

Open Access in DiVA

fulltext(1064 kB)194 downloads
File information
File name FULLTEXT01.pdfFile size 1064 kBChecksum SHA-512
67f60b108e84494f3d3e6f6d2ead92fffd0de92a9d6800323b63bc9abf2c861affe3e622f31ab4582306c5887cceb24fd5ac7876278c97f80f8ab047a54abc54
Type fulltextMimetype application/pdf

By organisation
Communication Theory
Other Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 194 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 441 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf