Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Structured Learning for Structural Bioinformatics: Applications of Deep Learning to Protein Structure Prediction
Stockholm University, Faculty of Science, Department of Biochemistry and Biophysics.ORCID iD: 0000-0003-3534-2986
2019 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Proteins are the basic molecular machines of the cell, performing a broad range of tasks, from structural support to catalysisof chemical reactions. Their function is determined by their 3D structure, which in turn is dictated by the order of their components, the amino acids.

This thesis is dedicated to applications of machine learning to the problems of contact prediction, ab-initio, and model quality assessment. In particular, my research has been focused on developing methods that are both effective, and easy to use.

In the first paper, we improved the already state-of-the-art model quality assessment (MQA) program ProQ3 replacing the underlying machine learning algorithm from svm to Deep Learning, baptised ProQ3D. The correlation between predicted and true scores was improved from 0.85 to 0.90, using the same training data and features.

The second paper joined several programs into a single pipeline for ab-initio structure prediction: contact prediction,folding, and model selection. We attempted to predict the structures of all 6379 PFAM families with unknown structure, ofwhich 558 we believe to be accurate. Of these, 415 had not been reported before.

The third paper uses advances in machine learning to build a contact predictor, PconsC4, that is fast and easy to deployin large-scale studies, since it requires a single Multiple Sequence Alignment (MSA), and no external dependencies. The predictions are state-of-the-art, yielding a 12% improvement in precision over PconsC3, and 244 times faster.

With ProQ4, in the fourth paper, we introduce a novel way of training deep networks for MQA in a way that minimises the bias of the training data, and emphasises model ranking, and demonstrate its viability with a minimal description ofthe protein. The ranking correlation was improved with respect to ProQ3D from 0.82 to 0.90.

Lastly, in the fifth paper, weshow the results of ProQ3D and ProQ4 in a completely blind test: CASP13.

Place, publisher, year, edition, pages
Stockholm: Department of Biochemistry and Biophysics, Stockholm University , 2019. , p. 63
National Category
Bioinformatics and Computational Biology
Research subject
Biochemistry towards Bioinformatics
Identifiers
URN: urn:nbn:se:su:diva-172395ISBN: 978-91-7797-797-1 (print)ISBN: 978-91-7797-798-8 (electronic)OAI: oai:DiVA.org:su-172395DiVA, id: diva2:1346726
Public defence
2019-10-11, Magnélisalen, Kemiska övningslaboratoriet, Svante Arrhenius väg 16 B, Stockholm, 13:00 (English)
Opponent
Supervisors
Note

At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 4: Manuscript.

Available from: 2019-09-18 Created: 2019-08-28 Last updated: 2025-02-07Bibliographically approved
List of papers
1. ProQ3D: improved model quality assessments using deep learning
Open this publication in new window or tab >>ProQ3D: improved model quality assessments using deep learning
Show others...
2017 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 33, no 10, p. 1578-1580Article in journal (Refereed) Published
Abstract [en]

Protein quality assessment is a long-standing problem in bioinformatics. For more than a decade we have developed state-of-art predictors by carefully selecting and optimising inputs to a machine learning method. The correlation has increased from 0.60 in ProQ to 0.81 in ProQ2 and 0.85 in ProQ3 mainly by adding a large set of carefully tuned descriptions of a protein. Here, we show that a substantial improvement can be obtained using exactly the same inputs as in ProQ2 or ProQ3 but replacing the support vector machine by a deep neural network. This improves the Pearson correlation to 0.90 (0.85 using ProQ2 input features).

Keywords
Model Quality Assessment, Protein Bioinformatics, Machine Learning, Deep Learning, Neural Networks, Multi Layer Perceptron, Deep neural networks
National Category
Bioinformatics and Computational Biology
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-137679 (URN)10.1093/bioinformatics/btw819 (DOI)000402130700023 ()
Funder
Swedish Research Council, VR-NT 2012-5046Swedish Research Council, VR-NT 2012-5270Swedish e‐Science Research Center
Available from: 2017-01-09 Created: 2017-01-09 Last updated: 2025-02-07Bibliographically approved
2. Large-scale structure prediction by improved contact predictions and model quality assessment
Open this publication in new window or tab >>Large-scale structure prediction by improved contact predictions and model quality assessment
2017 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 33, no 14, p. 123-129Article in journal (Refereed) Published
Abstract [en]

Motivation: Accurate contact predictions can be used for predicting the structure of proteins. Until recently these methods were limited to very big protein families, decreasing their utility. However, recent progress by combining direct coupling analysis with machine learning methods has made it possible to predict accurate contact maps for smaller families. To what extent these predictions can be used to produce accurate models of the families is not known. Results: We present the PconsFold2 pipeline that uses contact predictions from PconsC3, the CONFOLD folding algorithm and model quality estimations to predict the structure of a protein. We show that the model quality estimation significantly increases the number of models that reliably can be identified. Finally, we apply PconsFold2 to 6379 Pfam families of unknown structure and find that PconsFold2 can, with an estimated 90% specificity, predict the structure of up to 558 Pfam families of unknown structure. Out of these 415 have not been reported before. Availability: Datasets as well as models of all the 558 Pfam families are available at http://c3.pcons.net. All programs used here are freely available.

National Category
Bioinformatics (Computational Biology)
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-141945 (URN)10.1093/bioinformatics/btx239 (DOI)000405289100005 ()
Available from: 2017-04-21 Created: 2017-04-21 Last updated: 2022-03-23Bibliographically approved
3. PconsC4: fast, accurate and hassle-free contact predictions
Open this publication in new window or tab >>PconsC4: fast, accurate and hassle-free contact predictions
2019 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 35, no 15, p. 2677-2679Article in journal (Refereed) Published
Abstract [en]

Motivation

Residue contact prediction was revolutionized recently by the introduction of direct coupling analysis (DCA). Further improvements, in particular for small families, have been obtained by the combination of DCA and deep learning methods. However, existing deep learning contact prediction methods often rely on a number of external programs and are therefore computationally expensive.

Results

Here, we introduce a novel contact predictor, PconsC4, which performs on par with state of the art methods. PconsC4 is heavily optimized, does not use any external programs and therefore is significantly faster and easier to use than other methods.

Availability and implementation

PconsC4 is freely available under the GPL license from https://github.com/ElofssonLab/PconsC4. Installation is easy using the pip command and works on any system with Python 3.5 or later and a GCC compiler. It does not require a GPU nor special hardware.

Supplementary information

Supplementary data are available at Bioinformatics online.

National Category
Bioinformatics and Computational Biology
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-172392 (URN)10.1093/bioinformatics/bty1036 (DOI)000484378200024 ()
Available from: 2019-08-28 Created: 2019-08-28 Last updated: 2025-02-07Bibliographically approved
4. A novel training procedure to train deep networks in the assessment of the quality of protein models
Open this publication in new window or tab >>A novel training procedure to train deep networks in the assessment of the quality of protein models
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Motivation: Proteins fold into complex structures that are crucial for their biological functions. Experimental determination of protein structures iscostly and therefore limited to a small fraction of all known proteins. Hence,different computational structure prediction methods are necessary for themodelling of the vast majority of all proteins. In most structure predictionpipelines, the last step is to select the best available model and to estimateits accuracy. This model quality estimation problem has been growing inimportance during the last decade, and progress is believed to be importantfor large scale modelling of proteins. Current machine learning models trained to estimate the protein modelquality suffer from biases in the training set: multiple models of only a fewtargets, generated by a few methods.

Results: We propose a new methodology to train deep networks that leveragesthe structure of the problem and takes advantage of some of this redundan-cies. We demonstrate its viability by reaching results comparable with anotherstate-of-the-art method, ProQ3D, trained and evaluated on the same datasets,but employing only a small subset of the input features.The proposed training strategy is applicable to other input features anddatasets, and thus can be applied to other programs.

Availability: The code is freely available for download at: github.com/ElofssonLab/ProQ4 and runs with minimal requirements: requires only one multiplesequence alignment and a collection of models and depends only on Python3, hdf5, a deep learning framework compatible with Keras, and dssp.Contact: arne@bioinfo.se

National Category
Bioinformatics and Computational Biology
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-172393 (URN)
Available from: 2019-08-28 Created: 2019-08-28 Last updated: 2025-02-07Bibliographically approved
5. Estimation of model accuracy in CASP13
Open this publication in new window or tab >>Estimation of model accuracy in CASP13
Show others...
2019 (English)In: Proteins: Structure, Function, and Bioinformatics, ISSN 0887-3585, E-ISSN 1097-0134, Vol. 87, no 12, p. 1361-1377Article in journal (Refereed) Published
Abstract [en]

Methods to reliably estimate the accuracy of 3D models of proteins are both a fundamental part of most protein folding pipelines and important for reliable identification of the best models when multiple pipelines are used. Here, we describe the progress made from CASP12 to CASP13 in the field of estimation of model accuracy (EMA) as seen from the progress of the most successful methods in CASP13. We show small but clear progress, that is, several methods perform better than the best methods from CASP12 when tested on CASP13 EMA targets. Some progress is driven by applying deep learning and residue‐residue contacts to model accuracy prediction. We show that the best EMA methods select better models than the best servers in CASP13, but that there exists a great potential to improve this further. Also, according to the evaluation criteria based on local similarities, such as lDDT and CAD, it is now clear that single model accuracy methods perform relatively better than consensus‐based methods.

National Category
Bioinformatics and Computational Biology
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-172394 (URN)10.1002/prot.25767 (DOI)000476102200001 ()
Available from: 2019-08-28 Created: 2019-08-28 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

Structured Learning for Structural Bioinformatics(35899 kB)1068 downloads
File information
File name FULLTEXT01.pdfFile size 35899 kBChecksum SHA-512
6a1ce3f49bdacbc95b87e967a90660c56bb4dfd8fce6e7377b68d21ec3541a8a1edb282de6a72aeac06f0ff66b0f90443f412cdeb529b6b3deaf9d2a40bb2de7
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Menéndez Hurtado, David
By organisation
Department of Biochemistry and Biophysics
Bioinformatics and Computational Biology

Search outside of DiVA

GoogleGoogle Scholar
Total: 1068 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1160 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf