Change search
ReferencesLink to record
Permanent link

Direct link
Probabilistic Modelling of Domain and Gene Evolution
KTH, School of Computer Science and Communication (CSC), Computational Science and Technology (CST). (Jens Lagergren)ORCID iD: 0000-0002-6664-1607
2016 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Phylogenetic inference relies heavily on statistical models that have been extended and refined over the past years into complex hierarchical models to capture the intricacies of evolutionary processes. The wealth of information in the form of fully sequenced genomes has led to the development of methods that are used to reconstruct the gene and species evolutionary histories in greater and more accurate detail. However, genes are composed of evolutionary conserved sequence segments called domains, and domains can also be affected by duplications, losses, and bifurcations implied by gene or species evolution. This thesis proposes an extension of evolutionary models, such as duplication-loss, rate, and substitution, that have previously been used to model gene evolution, to model the domain evolution.

In this thesis, I am proposing DomainDLRS: a comprehensive, hierarchical Bayesian method, based on the DLRS model by Åkerborg et al., 2009, that models domain evolution as occurring inside the gene and species tree. The method incorporates a birth-death process to model the domain duplications and losses along with a domain sequence evolution model with a relaxed molecular clock assumption. The method employs a variant of Markov Chain Monte Carlo technique called, Grouped Independence Metropolis-Hastings for the estimation of posterior distribution over domain and gene trees. By using this method, we performed analyses of Zinc-Finger and PRDM9 gene families, which provides an interesting insight of domain evolution.

Finally, a synteny-aware approach for gene homology inference, called GenFamClust, is proposed that uses similarity and gene neighbourhood conservation to improve the homology inference. We evaluated the accuracy of our method on synthetic and two biological datasets consisting of Eukaryotes and Fungal species. Our results show that the use of synteny with similarity is providing a significant improvement in homology inference.

Place, publisher, year, edition, pages
Stockholm, Sweden: KTH Royal Institute of Technology, 2016. , 69 p.
Series
TRITA-CSC-A, ISSN 1653-5723 ; 19
Keyword [en]
Phylogenetics, Phylogenomics, Evolution, Domain Evolution, Gene tree, Domain tree, Bayesian Inference, Markov Chain Monte Carlo, Homology Inference, Gene families, C2H2 Zinc-Finger, Reelin Protein
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-191352ISBN: 978-91-7729-091-9OAI: oai:DiVA.org:kth-191352DiVA: diva2:956729
Public defence
2016-09-26, Conference room Air, SciLifeLab, Tomtebodavägen 23A, Solna, Stockholm, Stockholm, 09:00 (English)
Opponent
Supervisors
Funder
Swedish e‐Science Research CenterScience for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20160904

Available from: 2016-09-04 Created: 2016-08-29 Last updated: 2016-09-04Bibliographically approved
List of papers
1. Species tree aware simultaneous reconstruction of gene and domain evolution
Open this publication in new window or tab >>Species tree aware simultaneous reconstruction of gene and domain evolution
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Most genes are composed of multiple domains with a common evolutionary history that typically perform a specific function in the resulting protein. As witnessed by many studies of key gene families, it is important to understand how domains have been duplicated, lost, transferred between genes, and rearranged. Similarly to the case of evolutionary events affecting entire genes, these domain events have large consequences for phylogenetic reconstruction and, in addition, they create considerable obstacles for gene sequence alignment algorithms, a prerequisite for phylogenetic reconstruction.

We introduce the Domain-DLRS model, a hierarchical, generative probabilistic model containing three levels corresponding to species, genes, and domains, respectively. From a dated species tree, a gene tree is generated according to the DL model, which is a birth-death model generalized to occur in a dated tree. Then, from the dated gene tree, a pre-specified number of dated domain trees are generated using the DL model and the molecular clock is relaxed, effectively converting edge times to edge lengths. Finally, for each domain tree and its lengths, domain sequences are generated for the leaves based on a selected model of sequence evolution.

For this model, we present a MCMC based inference framework called Domain-DLRS that as input takes a dates species tree together with a multiple sequence alignment for each domain family, while it as output provids an estimated posterior distribution over reconciled gene and domain trees. By requiring aligned domains rather than genes, our framework evades the problem of aligning genes that have been exposed to domain duplications, in particular non-tandem domain duplications. We show that Domain-DLRS performs better than MrBayes on synthetic data and that it outperforms MrBayes on biological data. We analyse several zinc-finger genes and show that most domain duplications have been tandem duplications, of which some have involved two or more domains, but non-tandem duplications have also been common, in particular in gene families of complex evolutionary history such as PRDM9.

Keyword
Probabilistic Modeling, Domain Evolution, Bayesian Inference, Domain Tree Reconstruction
National Category
Bioinformatics (Computational Biology)
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-191349 (URN)
External cooperation:
Funder
Swedish e‐Science Research Center
Note

QC 20160902

Available from: 2016-08-29 Created: 2016-08-29 Last updated: 2016-09-02Bibliographically approved
2. Sequence Analysis and Evolutionary Studies of Reelin Proteins
Open this publication in new window or tab >>Sequence Analysis and Evolutionary Studies of Reelin Proteins
2015 (English)In: Bioinformatics and Biology Insights, ISSN 1177-9322, E-ISSN 1177-9322, Vol. 9, 187-193 p.Article in journal (Refereed) Published
Abstract [en]

The reelin gene is conserved across many vertebrate species, including humans. The protein product of this gene plays several important roles in early brain development and regulation of neural network plasticity of a matured brain structure. With an extended structure of 3461 amino acid sequences, consisting of eight reelin repeats, the human reelin sequence stands out as an exceptional model for evolutionary studies. In this study, sequence analysis of the human reelin and its homologues and reelin sequences from 104 other species is described in detail. Interesting sequence conservation patterns of individual repeats have been highlighted. Sequence phylogeny of the reelin sequences indicates a pattern similar to the evolution of the species, thereby serving as a highly conserved family for evolutionary purposes. Multiple sequence alignment of different reelin domain repeats, derived from homologues, suggests specific functions for individual repeats and high sequence conservation across reelin repeats from different organisms, albeit with few unusual domain architectures. A three-dimensional structural model of the full-length human reelin is now available that provides clues on residues at the dimer interface.

Place, publisher, year, edition, pages
Libertas Academica, 2015
Keyword
reelin protein, glycoprotein, domain repeats, phylogeny, domain architecture, neurogenesis, 3D modeling
National Category
Biochemistry and Molecular Biology
Identifiers
urn:nbn:se:kth:diva-181010 (URN)10.4137/BBI.S26530 (DOI)000367288300004 ()26715843 (PubMedID)2-s2.0-84961266629 (ScopusID)
External cooperation:
Note

QC 20160126

Available from: 2016-01-26 Created: 2016-01-26 Last updated: 2016-08-31Bibliographically approved
3. Quantitative synteny scoring improves homology inference and partitioning of gene families
Open this publication in new window or tab >>Quantitative synteny scoring improves homology inference and partitioning of gene families
2013 (English)In: BMC Bioinformatics, ISSN 1471-2105, Vol. 14, S12- p.Article in journal (Refereed) Published
Abstract [en]

Background: Clustering sequences into families has long been an important step in characterization of genes and proteins. There are many algorithms developed for this purpose, most of which are based on either direct similarity between gene pairs or some sort of network structure, where weights on edges of constructed graphs are based on similarity. However, conserved synteny is an important signal that can help distinguish homology and it has not been utilized to its fullest potential. Results: Here, we present GenFamClust, a pipeline that combines the network properties of sequence similarity and synteny to assess homology relationship and merge known homologs into groups of gene families. GenFamClust identifies homologs in a more informed and accurate manner as compared to similarity based approaches. We tested our method against the Neighborhood Correlation method on two diverse datasets consisting of fully sequenced genomes of eukaryotes and synthetic data. Conclusions: The results obtained from both datasets confirm that synteny helps determine homology and GenFamClust improves on Neighborhood Correlation method. The accuracy as well as the definition of synteny scores is the most valuable contribution of GenFamClust.

Place, publisher, year, edition, pages
BioMed Central, 2013
Keyword
Efficient Algorithm, Eukaryotic Genomes, Protein Families, Orthologs, Identification, Clusters, Alignment, Blast, Link
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:kth:diva-136429 (URN)10.1186/1471-2105-14-S15-S12 (DOI)000328316700012 ()
Conference
11th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics, Lyon,France OCT 17-19, 2013
Funder
Swedish e‐Science Research CenterScience for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20131219

Available from: 2013-12-05 Created: 2013-12-05 Last updated: 2016-08-31Bibliographically approved
4. GenFamClust: An accurate, synteny-aware and reliable homology inference algorithm
Open this publication in new window or tab >>GenFamClust: An accurate, synteny-aware and reliable homology inference algorithm
2016 (English)In: BMC EVOLUTIONARY BIOLOGY, ISSN 1471-2148, Vol. 16Article in journal (Other academic) Published
Abstract [en]

Background: Homology inference is pivotal to evolutionary biology and is primarily based on significant sequence similarity, which, in general, is a good indicator of homology. Algorithms have also been designed to utilize conservation in gene order as an indication of homologous regions. We have developed GenFamClust, a method based on quantification of both gene order conservation and sequence similarity. Results: In this study, we validate GenFamClust by comparing it to well known homology inference algorithms on a synthetic dataset. We applied several popular clustering algorithms on homologs inferred by GenFamClust and other algorithms on a metazoan dataset and studied the outcomes. Accuracy, similarity, dependence, and other characteristics were investigated for gene families yielded by the clustering algorithms. GenFamClust was also applied to genes from a set of complete fungal genomes and gene families were inferred using clustering. The resulting gene families were compared with a manually curated gold standard of pillars from the Yeast Gene Order Browser. We found that the gene-order component of GenFamClust is simple, yet biologically realistic, and captures local synteny information for homologs. Conclusions: The study shows that GenFamClust is a more accurate, informed, and comprehensive pipeline to infer homologs and gene families than other commonly used homology and gene-family inference methods.

Place, publisher, year, edition, pages
BioMed Central, 2016
Keyword
Homology inference; Gene synteny; Gene similarity; Gene family; Clustering; Gene order conservation
National Category
Bioinformatics and Systems Biology
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-180542 (URN)10.1186/s12862-016-0684-2 (DOI)000377161400002 ()27260514 (PubMedID)2-s2.0-84973324604 (ScopusID)
Funder
Swedish e‐Science Research Center
Note

QC 20160628

Available from: 2016-01-18 Created: 2016-01-18 Last updated: 2016-08-31Bibliographically approved

Open Access in DiVA

thesis.pdf(1755 kB)34 downloads
File information
File name FULLTEXT01.pdfFile size 1755 kBChecksum SHA-512
d7621d06f729673637530b572eb6173e56d9bafd0ff7fb812c82f7b2195d1840e64afca76e86a7d3fa73f34697a60247ac18f4dd81187b887575026be2c2f95e
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Muhammad, Sayyed Auwn
By organisation
Computational Science and Technology (CST)
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 34 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 372 hits
ReferencesLink to record
Permanent link

Direct link