Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
Umeå University, Faculty of Science and Technology, Department of Chemistry.ORCID iD: 0000-0002-4476-9255
Umeå University, Faculty of Science and Technology, Department of Chemistry. Corporate Research, Sartorius AG, Umeå, Sweden.ORCID iD: 0000-0001-7881-0968
Show others and affiliations
2019 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 20, no 1, article id 498Article in journal (Refereed) Published
Abstract [en]

Background: Selecting the proper parameter settings for bioinformatic software tools is challenging. Not only will each parameter have an individual effect on the outcome, but there are also potential interaction effects between parameters. Both of these effects may be difficult to predict. To make the situation even more complex, multiple tools may be run in a sequential pipeline where the final output depends on the parameter configuration for each tool in the pipeline. Because of the complexity and difficulty of predicting outcomes, in practice parameters are often left at default settings or set based on personal or peer experience obtained in a trial and error fashion. To allow for the reliable and efficient selection of parameters for bioinformatic pipelines, a systematic approach is needed.

Results: We present doepipeline, a novel approach to optimizing bioinformatic software parameters, based on core concepts of the Design of Experiments methodology and recent advances in subset designs. Optimal parameter settings are first approximated in a screening phase using a subset design that efficiently spans the entire search space, then optimized in the subsequent phase using response surface designs and OLS modeling. Doepipeline was used to optimize parameters in four use cases; 1) de-novo assembly, 2) scaffolding of a fragmented genome assembly, 3) k-mer taxonomic classification of Oxford Nanopore Technologies MinION reads, and 4) genetic variant calling. In all four cases, doepipeline found parameter settings that produced a better outcome with respect to the characteristic measured when compared to using default values. Our approach is implemented and available in the Python package doepipeline.

Conclusions: Our proposed methodology provides a systematic and robust framework for optimizing software parameter settings, in contrast to labor- and time-intensive manual parameter tweaking. Implementation in doepipeline makes our methodology accessible and user-friendly, and allows for automatic optimization of tools in a wide range of cases. The source code of doepipeline is available at https://github.com/clicumu/doepipeline and it can be installed through conda-forge.

Place, publisher, year, edition, pages
BioMed Central, 2019. Vol. 20, no 1, article id 498
Keywords [en]
Design of Experiments, Optimization, Sequencing, Nanopore, MinION, Assembly, Classification, Scaffolding, Variant calling
National Category
Bioinformatics and Systems Biology Bioinformatics (Computational Biology)
Identifiers
URN: urn:nbn:se:umu:diva-164986DOI: 10.1186/s12859-019-3091-zISI: 000490501600003PubMedID: 31615395OAI: oai:DiVA.org:umu-164986DiVA, id: diva2:1369182
Funder
Knut and Alice Wallenberg Foundation, 2011.0042Swedish Research Council, 2016-04376eSSENCE - An eScience CollaborationSwedish Armed ForcesAvailable from: 2019-11-11 Created: 2019-11-11 Last updated: 2019-11-11Bibliographically approved

Open Access in DiVA

fulltext(814 kB)4 downloads
File information
File name FULLTEXT01.pdfFile size 814 kBChecksum SHA-512
8b95d52ce6ebeeceb90dff9d1edda7072500c9ed26eca854cb17e1978be6f765c676a37bd075b7b23467c353165051bf8bae7e9a0adb08f854e04c9525bc7973
Type fulltextMimetype application/pdf

Other links

Publisher's full textPubMed

Search in DiVA

By author/editor
Svensson, DanielSjögren, RickardTrygg, Johan
By organisation
Department of Chemistry
In the same journal
BMC Bioinformatics
Bioinformatics and Systems BiologyBioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 4 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 27 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf