Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
From Sequence to Structure: Using predicted residue contacts to facilitate template-free protein structure prediction
Stockholm University, Faculty of Science, Department of Biochemistry and Biophysics.
2017 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Despite the fundamental role of experimental protein structure determination, computational methods are of essential importance to bridge the ever growing gap between available protein sequence and structure data. Common structure prediction methods rely on experimental data, which is not available for about half of the known protein families.

Recent advancements in amino acid contact prediction have revolutionized the field of protein structure prediction. Contacts can be used to guide template-free structure predictions that do not rely on experimentally solved structures of homologous proteins. Such methods are now able to produce accurate models for a wide range of protein families.

We developed PconsC2, an approach that improved existing contact prediction methods by recognizing intra-molecular contact patterns and noise reduction. An inherent problem of contact prediction based on maximum entropy models is that large alignments with over 1000 effective sequences are needed to infer contacts accurately. These are however not available for more than 80% of all protein families that do not have a representative structure in PDB. With PconsC3, we could extend the applicability of contact prediction to families as small as 100 effective sequences by combining global inference methods with machine learning based on local pairwise measures.

By introducing PconsFold, a pipeline for contact-based structure prediction, we could show that improvements in contact prediction accuracy translate to more accurate models. Finally, we applied a similar technique to Pfam, a comprehensive database of known protein families. In addition to using a faster folding protocol we employed model quality assessment methods, crucial for estimating the confidence in the accuracy of predicted models. We propose models tobe accurate for 558 families that do not have a representative known structure. Out of those, over 75% have not been reported before.

Place, publisher, year, edition, pages
Stockholm: Department of Biochemistry and Biophysics, Stockholm University , 2017.
Keywords [en]
protein bioinformatics, protein structure prediction, contact prediction, machine learning
National Category
Bioinformatics (Computational Biology)
Research subject
Biochemistry towards Bioinformatics
Identifiers
URN: urn:nbn:se:su:diva-141946ISBN: 978-91-7649-811-8 (print)ISBN: 978-91-7649-812-5 (electronic)OAI: oai:DiVA.org:su-141946DiVA, id: diva2:1090939
Public defence
2017-06-09, Magnélisalen, Kemiska övningslaboratoriet, Svante Arrhenius väg 16 B, Stockholm, 13:00 (English)
Opponent
Supervisors
Note

At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 2: Submitted. Paper 4: In press.

 

Available from: 2017-05-17 Created: 2017-04-25 Last updated: 2018-01-13Bibliographically approved
List of papers
1. Improved Contact Predictions Using the Recognition of Protein Like Contact Patterns
Open this publication in new window or tab >>Improved Contact Predictions Using the Recognition of Protein Like Contact Patterns
2014 (English)In: PloS Computational Biology, ISSN 1553-734X, E-ISSN 1553-7358, Vol. 10, no 11, article id e1003889Article in journal (Refereed) Published
Abstract [en]

Given sufficient large protein families, and using a global statistical inference approach, it is possible to obtain sufficient accuracy in protein residue contact predictions to predict the structure of many proteins. However, these approaches do not consider the fact that the contacts in a protein are neither randomly, nor independently distributed, but actually follow precise rules governed by the structure of the protein and thus are interdependent. Here, we present PconsC2, a novel method that uses a deep learning approach to identify protein-like contact patterns to improve contact predictions. A substantial enhancement can be seen for all contacts independently on the number of aligned sequences, residue separation or secondary structure type, but is largest for b-sheet containing proteins. In addition to being superior to earlier methods based on statistical inferences, in comparison to state of the art methods using machine learning, PconsC2 is superior for families with more than 100 effective sequence homologs. The improved contact prediction enables improved structure prediction.

National Category
Biological Sciences
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-111936 (URN)10.1371/journal.pcbi.1003889 (DOI)000345454400003 ()
Funder
Swedish Research Council
Note

AuthorCount:4;

Available from: 2015-01-08 Created: 2015-01-08 Last updated: 2017-05-08Bibliographically approved
2. Predicting accurate contacts in thousands of Pfam domain families using PconsC3
Open this publication in new window or tab >>Predicting accurate contacts in thousands of Pfam domain families using PconsC3
Show others...
(English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811Article in journal (Refereed) In press
Abstract [en]

Motivation: A few years ago it was shown that by using a maximum entropy approach to describe couplings between columns in a multiple sequence alignment it is possible to significantly increase the accuracy of residue contact predictions. For very large protein families with more than 1000 effective sequences the accuracy is sufficient to produce accurate models of proteins as well as complexes. Today, for about half of all Pfam domain families no structure is known, but unfortunately most of these families have at most afew hundred members, i.e. are too small for such contact prediction methods.

Results: To extend accurate contact predictions to the thousands of smaller protein families we present PconsC3, a fast and improved method for protein contact predictions that can be used for families with even 100 effective sequence members. PconsC3 outperforms direct coupling analysis (DCA) methods significantly independent on family size, secondary structure content, contact range, or the number of selected contacts.

Availability: PconsC3 is available as a web server and downloadable version at http://c3.pcons.net. The downloadable version is free for all to use and licensed under the GNU General Public License, version 2. At this site contact predictions for most Pfam families are also available. We do estimate that more than 4000 contact maps for Pfam families of unknown structure have more than 50% of the top-ranked contacts predicted correctly.

National Category
Bioinformatics (Computational Biology)
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-141943 (URN)
Available from: 2017-04-21 Created: 2017-04-21 Last updated: 2018-01-13
3. PconsFold: improved contact predictions improve protein models
Open this publication in new window or tab >>PconsFold: improved contact predictions improve protein models
Show others...
2014 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 30, no 17, p. 1482-1488Article in journal (Refereed) Published
Abstract [en]

Motivation: Recently it has been shown that the quality of protein contact prediction from evolutionary information can be improved significantly if direct and indirect information is separated. Given sufficiently large protein families, the contact predictions contain sufficient information to predict the structure of many protein families. However, since the first studies contact prediction methods have improved. Here, we ask how much the final models are improved if improved contact predictions are used.

Results: In a small benchmark of 15 proteins, we show that the TM-scores of top-ranked models are improved by on average 33% using PconsFold compared with the original version of EVfold. In a larger benchmark, we find that the quality is improved with 15-30% when using PconsC in comparison with earlier contact prediction methods. Further, using Rosetta instead of CNS does not significantly improve global model accuracy, but the chemistry of models generated with Rosetta is improved.

National Category
Bioinformatics (Computational Biology)
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-108840 (URN)10.1093/bioinformatics/btu458 (DOI)000342912400020 ()25161237 (PubMedID)
Available from: 2014-11-04 Created: 2014-11-04 Last updated: 2018-01-11Bibliographically approved
4. Large-scale structure prediction by improved contact predictions and model quality assessment
Open this publication in new window or tab >>Large-scale structure prediction by improved contact predictions and model quality assessment
2017 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 33, no 14, p. 123-129Article in journal (Refereed) Published
Abstract [en]

Motivation: Accurate contact predictions can be used for predicting the structure of proteins. Until recently these methods were limited to very big protein families, decreasing their utility. However, recent progress by combining direct coupling analysis with machine learning methods has made it possible to predict accurate contact maps for smaller families. To what extent these predictions can be used to produce accurate models of the families is not known. Results: We present the PconsFold2 pipeline that uses contact predictions from PconsC3, the CONFOLD folding algorithm and model quality estimations to predict the structure of a protein. We show that the model quality estimation significantly increases the number of models that reliably can be identified. Finally, we apply PconsFold2 to 6379 Pfam families of unknown structure and find that PconsFold2 can, with an estimated 90% specificity, predict the structure of up to 558 Pfam families of unknown structure. Out of these 415 have not been reported before. Availability: Datasets as well as models of all the 558 Pfam families are available at http://c3.pcons.net. All programs used here are freely available.

National Category
Bioinformatics (Computational Biology)
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-141945 (URN)10.1093/bioinformatics/btx239 (DOI)000405289100005 ()
Available from: 2017-04-21 Created: 2017-04-21 Last updated: 2018-01-13Bibliographically approved

Open Access in DiVA

From Sequence to Structure(1959 kB)133 downloads
File information
File name FULLTEXT01.pdfFile size 1959 kBChecksum SHA-512
65d3be8263b6e1515705ab99045aba732107a75eec5a5f2294527d8dde7e6a706530e362325a48d6b014e63354cafb4888e294bf74bb26188f8098c336b21ece
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Michel, Mirco
By organisation
Department of Biochemistry and Biophysics
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 133 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 3758 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf