Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Algorithms and statistical models for scaffolding contig assemblies and detecting structural variants using read pair data
KTH, School of Computer Science and Communication (CSC), Computational Biology, CB. (Arvestad)ORCID iD: 0000-0001-7378-2320
2015 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Advances in throughput from Next Generation Sequencing (NGS) methods has provided new ways to study molecular biology. The increased amount of data enables genome wide scale studies of structural variation, transcription, translation and genome composition. Not only is the scale of each experiment large; lowered cost and faster turn-around has also increased the frequency with which new experiments are conducted. With the data growth comes an increase in demand for efficient and robust algorithms — this is a great computational challenge. The design of computationally efficient algorithms are crucial to cope with the amount of data and it is relatively easy to verify an efficient algorithm by runtime and memory consumption. However, as NGS data comes with several artifacts together with the size the difficulty lies in verifying that the algorithm gives accurate results and are robust to different data sets.

This thesis focuses on modeling assumptions of mate-pair and paired-end reads when scaffolding contig assemblies or detecting variants. Both genome assembly and structural variation are difficult problems, partly because of a computationally complex nature of the problems, but also due to various noise and artifacts in input data. Constructing methods that addresses all artifacts and parameters in data is difficult, if not impossible, and end-to-end pipelines often come with several simplifications. Instead of tackling these difficult problems all at once, a large part of this thesis concentrates on smaller problems around scaffolding and structural variation detection. By identifying and modeling parts of the problem where simplifications has been made in other algorithms, we obtain an improved solution to the corresponding full problem.

The first paper shows an improved model to estimate gap sizes, hence contig placement, in the scaffolding problem. The second paper introduces a new scaffolder to scaffold large complex genomes and the third paper extends the scaffolding method to account for paired-end-contamination in mate-pair libraries. The fourth paper investigates detection of structural variants using fragment length information and corrects a commonly assumed null-hypothesis distribution used to detect structural variants.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2015. , x, 59 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 2015:14
National Category
Bioinformatics (Computational Biology)
Identifiers
URN: urn:nbn:se:kth:diva-173580ISBN: 978-91-7595-677-0 (print)OAI: oai:DiVA.org:kth-173580DiVA: diva2:853717
Public defence
2015-10-01, Atrium, Nobels väg 12B, Stockholm, 10:00 (English)
Opponent
Supervisors
Note

QC 20150915

Available from: 2015-09-15 Created: 2015-09-14 Last updated: 2015-09-15Bibliographically approved
List of papers
1. Improved gap size estimation for scaffolding algorithms
Open this publication in new window or tab >>Improved gap size estimation for scaffolding algorithms
2012 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 28, no 17, 2215-2222 p.Article in journal (Refereed) Published
Abstract [en]

Motivation: One of the important steps of genome assembly is scaffolding, in which contigs are linked using information from read-pairs. Scaffolding provides estimates about the order, relative orientation and distance between contigs. We have found that contig distance estimates are generally strongly biased and based on false assumptions. Since erroneous distance estimates can mislead in subsequent analysis, it is important to provide unbiased estimation of contig distance.Results: In this article, we show that state-of-the-art programs for scaffolding are using an incorrect model of gap size estimation. We discuss why current maximum likelihood estimators are biased and describe what different cases of bias we are facing. Furthermore, we provide a model for the distribution of reads that span a gap and derive the maximum likelihood equation for the gap length. We motivate why this estimate is sound and show empirically that it outperforms gap estimators in popular scaffolding programs. Our results have consequences both for scaffolding software, structural variation detection and for library insert-size estimation as is commonly performed by read aligners.

National Category
Bioinformatics and Systems Biology
Research subject
SRA - E-Science (SeRC)
Identifiers
urn:nbn:se:kth:diva-101249 (URN)10.1093/bioinformatics/bts441 (DOI)000308019200001 ()2-s2.0-84865546399 (Scopus ID)
Funder
Swedish Research Council, 2010-4634Science for Life Laboratory - a national resource center for high-throughput molecular bioscienceSwedish e‐Science Research Center
Note

QC 20120912

Available from: 2012-08-25 Created: 2012-08-25 Last updated: 2017-12-07Bibliographically approved
2. BESST - Efficient scaffolding of large fragmented assemblies
Open this publication in new window or tab >>BESST - Efficient scaffolding of large fragmented assemblies
Show others...
2014 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 15, no 1, 281- p.Article in journal (Refereed) Published
Abstract [en]

Background: The use of short reads from High Throughput Sequencing (HTS) techniques is now commonplace in de novo assembly. Yet, obtaining contiguous assemblies from short reads is challenging, thus making scaffolding an important step in the assembly pipeline. Different algorithms have been proposed but many of them use the number of read pairs supporting a linking of two contigs as an indicator of reliability. This reasoning is intuitive, but fails to account for variation in link count due to contig features. We have also noted that published scaffolders are only evaluated on small datasets using output from only one assembler. Two issues arise from this. Firstly, some of the available tools are not well suited for complex genomes. Secondly, these evaluations provide little support for inferring a software's general performance. Results: We propose a new algorithm, implemented in a tool called BESST, which can scaffold genomes of all sizes and complexities and was used to scaffold the genome of P. abies (20 Gbp). We performed a comprehensive comparison of BESST against the most popular stand-alone scaffolders on a large variety of datasets. Our results confirm that some of the popular scaffolders are not practical to run on complex datasets. Furthermore, no single stand-alone scaffolder outperforms the others on all datasets. However, BESST fares favorably to the other tested scaffolders on GAGE datasets and, moreover, outperforms the other methods when library insert size distribution is wide. Conclusion: We conclude from our results that information sources other than the quantity of links, as is commonly used, can provide useful information about genome structure when scaffolding.

Keyword
Genome analysis, Genome assembly, Mate pair next-generation sequencing, Scaffolding
National Category
Biochemistry and Molecular Biology
Identifiers
urn:nbn:se:kth:diva-152583 (URN)10.1186/1471-2105-15-281 (DOI)000341198900001 ()2-s2.0-84906826446 (Scopus ID)
Funder
Swedish Research Council, 2010-4634Knut and Alice Wallenberg FoundationScience for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20140929

Available from: 2014-09-29 Created: 2014-09-29 Last updated: 2017-12-05Bibliographically approved
3. Genome scaffolding with PE-contaminated mate-pair libraries
Open this publication in new window or tab >>Genome scaffolding with PE-contaminated mate-pair libraries
2015 (English)Manuscript (preprint) (Other academic)
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:kth:diva-173579 (URN)10.1101/025650 (DOI)
Note

QS 2015

Available from: 2015-09-14 Created: 2015-09-14 Last updated: 2016-02-02Bibliographically approved
4. Correcting bias from stochastic insert size in read pair data—applications to structural variation detection and genome assembly
Open this publication in new window or tab >>Correcting bias from stochastic insert size in read pair data—applications to structural variation detection and genome assembly
2015 (English)Manuscript (preprint) (Other academic)
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:kth:diva-173585 (URN)10.1101/023929 (DOI)
Note

QS 2015

Available from: 2015-09-14 Created: 2015-09-14 Last updated: 2016-02-02Bibliographically approved

Open Access in DiVA

Thesis(1088 kB)344 downloads
File information
File name FULLTEXT01.pdfFile size 1088 kBChecksum SHA-512
061b936b7a5a1bcffd659094bbfad50ed7398811ad035204d3f7bde6a63ff5013463a2d27318072caf3b72e9b77ff2b80f09ee4511376ba9fe2a94988c6bcd07
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Sahlin, Kristoffer
By organisation
Computational Biology, CB
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 344 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 5109 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf