Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Phylogenetic Tree Estimation With and Without Alignment: New Distance Methods and Benchmarking
Uppsala University, Disciplinary Domain of Science and Technology, Biology, Department of Ecology and Genetics, Evolutionary Biology. (Whelan Lab)ORCID iD: 0000-0003-3056-3173
Uppsala University, Disciplinary Domain of Science and Technology, Biology, Department of Ecology and Genetics, Evolutionary Biology.
2017 (English)In: Systematic Biology, ISSN 1063-5157, E-ISSN 1076-836X, Vol. 66, no 2, p. 218-231Article in journal (Refereed) Published
Abstract [en]

Phylogenetic tree inference is a critical component of many systematic and evolutionary studies. The majority of these studies are based on the two-step process of multiple sequence alignment followed by tree inference, despite persistent evidence that the alignment step can lead to biased results. Here we present a two-part study that first presents PaHMM-Tree, a novel neighbor joining-based method that estimates pairwise distances without assuming a single alignment. We then use simulations to benchmark its performance against a wide-range of other phylogenetic tree inference methods, including the first comparison of alignment-free distance-based methods against more conventional tree estimation methods. Our new method for calculating pairwise distances based on statistical alignment provides distance estimates that are as accurate as those obtained using standard methods based on the true alignment. Pairwise distance estimates based on the two-step process tend to be substantially less accurate. This improved performance carries through to tree inference, where PaHMM-Tree provides more accurate tree estimates than all of the pairwise distance methods assessed. For close to moderately divergent sequence data we find that the two-step methods using statistical inference, where information from all sequences is included in the estimation procedure, tend to perform better than PaHMM-Tree, particularly full statistical alignment, which simultaneously estimates both the tree and the alignment. For deep divergences we find the alignment step becomes so prone to error that our distance-based PaHMM-Tree outperforms all other methods of tree inference. Finally, we find that the accuracy of alignment-free methods tends to decline faster than standard two-step methods in the presence of alignment uncertainty, and identify no conditions where alignment-free methods are equal to or more accurate than standard phylogenetic methods even in the presence of substantial alignment error.

Place, publisher, year, edition, pages
2017. Vol. 66, no 2, p. 218-231
Keywords [en]
Alignment-free, distance-based phylogenetics, pair Hidden Markov Models, phylogenetic inference, statistical alignment
National Category
Evolutionary Biology
Identifiers
URN: urn:nbn:se:uu:diva-316533DOI: 10.1093/sysbio/syw074ISI: 000397703800009PubMedID: 27633353OAI: oai:DiVA.org:uu-316533DiVA, id: diva2:1078108
Available from: 2017-03-02 Created: 2017-03-02 Last updated: 2018-09-19Bibliographically approved
In thesis
1. Evolutionary Approaches to Sequence Alignment
Open this publication in new window or tab >>Evolutionary Approaches to Sequence Alignment
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Molecular evolutionary biology allows us to look into the past by analyzing sequences of amino acids or nucleotides. These analyses can be very complex, often involving advanced statistical models of sequence evolution to construct phylogenetic trees, study the patterns of natural selection and perform a number of other evolutionary studies. In many cases, these evolutionary studies require a prerequisite of multiple sequence alignment (MSA) - a technique, which aims at grouping the characters that share a common ancestor, or homology, into columns. This information regarding shared homology is needed by statistical models to describe the process of substitutions in order to perform evolutionary inference. Sequence alignment, however, is difficult and MSAs often contain whole regions of wrongly aligned characters, which impact downstream analyses.

In this thesis I use two broad groups of approaches to avoid errors in the alignment. The first group addresses the analysis methods without sequence alignment by explicitly modelling the processes of substitutions, and insertions and deletions (indels) between pairs of sequences using pair hidden Markov models. I describe an accurate tree inference method that uses a neighbor joining clustering approach to construct a tree from a matrix of model-based evolutionary distances.

Next, I develop a pairwise method of modelling how natural selection acts on substitutions and indels. I further show the relationship between the constraints acting on these two evolutionary forces to show that natural selection affects them in a similar way.

The second group of approaches deals with errors in existing alignments. I use a statistical model-based approach to evaluate the quality of multiple sequence alignments.

First, I provide a graph-based tool for removing wrongly aligned pairs of residues by splitting them apart. This approach tends to produce better results when compared to standard column-based filtering.

Second, I provide a way to compare MSAs using a probabilistic framework. I propose new ways of scoring of sequence alignments and show that popular methods produce similar results.

The overall purpose of this work is to facilitate more accurate evolutionary analyses by addressing the problem of sequence alignment in a statistically rigorous manner.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2018. p. 57
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 1723
Keywords
molecular evolution, multiple sequence alignment, pair hidden Markov models
National Category
Evolutionary Biology
Research subject
Biology with specialization in Evolutionary Genetics
Identifiers
urn:nbn:se:uu:diva-360871 (URN)978-91-513-0445-8 (ISBN)
Public defence
2018-11-09, Ekmansalen, EBC, Norrbyvägen 14, Uppsala, 09:00 (English)
Opponent
Supervisors
Available from: 2018-10-17 Created: 2018-09-19 Last updated: 2018-10-17

Open Access in DiVA

fulltext(875 kB)158 downloads
File information
File name FULLTEXT01.pdfFile size 875 kBChecksum SHA-512
520c7187e65d2628750c2458bb7753c8987f3b92db88aec85ff9eeeac3b4d3a20ec954f6191455d4855522995c49f3ccf2cec763aa5ad4ede6a99729000cbea4
Type fulltextMimetype application/pdf

Other links

Publisher's full textPubMed

Search in DiVA

By author/editor
Bogusz, MarcinWhelan, Simon
By organisation
Evolutionary Biology
In the same journal
Systematic Biology
Evolutionary Biology

Search outside of DiVA

GoogleGoogle Scholar
Total: 158 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 652 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf