Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
The accuracy of statistical confidence estimates in shotgun proteomics
Stockholm University, Faculty of Science, Department of Biochemistry and Biophysics.
2014 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

High-throughput techniques are currently some of the most promising methods to study molecular biology, with the potential to improve medicine and enable new biological applications. In proteomics, the large scale study of proteins, the leading method is mass spectrometry. At present researchers can routinely identify and quantify thousands of proteins in a single experiment with the technique called shotgun proteomics.

A challenge of these experiments is the computational analysis and the interpretation of the mass spectra. A shotgun proteomics experiment easily generates tens of thousands of spectra, each thought to represent a peptide from a protein. Due to the immense biological and technical complexity, however, our computational tools often misinterpret these spectra and derive incorrect peptides. As a consequence, the biological interpretation of the experiment relies heavily on the statistical confidence that we estimate for the identifications.

In this thesis, I have included four articles from my research on the accuracy of the statistical confidence estimates in shotgun proteomics, how to accomplish and evaluate it. In the first two papers a new method to use pre-characterized protein samples to evaluate this accuracy is presented. The third paper deals with how to avoid statistical inaccuracies when using machine learning techniques to analyze the data. In the fourth paper, we present a new tool for analyzing shotgun proteomics results, and evaluate the accuracy of  its statistical estimates using the method from the first papers.

The work I have included here can facilitate the development of new and accurate computational tools in mass spectrometry-based proteomics. Such tools will help making the interpretation of the spectra and the downstream biological conclusions more reliable.

Place, publisher, year, edition, pages
Stockholm: Department of Biochemistry and Biophysics, Stockholm University , 2014. , 40 p.
Keyword [en]
Proteomics, Peptides, Statistics, Mass spectrometry, Tandem mass spectrometry
National Category
Bioinformatics (Computational Biology)
Research subject
Biochemistry towards Bioinformatics
Identifiers
URN: urn:nbn:se:su:diva-100769ISBN: 978-91-7447-787-0 (print)OAI: oai:DiVA.org:su-100769DiVA: diva2:696094
Public defence
2014-04-04, Magnélisalen, Kemiska övningslaboratoriet, Svante Arrhenius väg 16 B, Stockholm, 09:30 (English)
Opponent
Supervisors
Available from: 2014-03-13 Created: 2014-02-12 Last updated: 2014-02-14Bibliographically approved
List of papers
1. On Using Samples of Known Protein Content to Assess the Statistical Calibration of Scores Assigned to Peptide-Spectrum Matches in Shotgun Proteomics
Open this publication in new window or tab >>On Using Samples of Known Protein Content to Assess the Statistical Calibration of Scores Assigned to Peptide-Spectrum Matches in Shotgun Proteomics
2011 (English)In: Journal of Proteome Research, ISSN 1535-3893, E-ISSN 1535-3907, Vol. 10, no 5, 2671-2678 p.Article in journal (Refereed) Published
Abstract [en]

In shotgun proteomics, the quality of a hypothesized match between an observed spectrum and a peptide sequence is quantified by a score function. Because the score function lies at the heart of any peptide identification pipeline, this function greatly affects the final results of a proteomics assay. Consequently, valid statistical methods for assessing the quality of a given score function are extremely important. Previously, several research groups have used samples of known protein composition to assess the quality of a given score function. We demonstrate that this approach is problematic, because the outcome can depend on factors other than the score function itself. We then propose an alternative use of the same type of data to validate a score function. The central idea of our approach is that database matches that are not explained by any protein in the purified sample comprise a robust representation of incorrect matches. We apply our alternative assessment scheme to several commonly used score functions, and we show that our approach generates a reproducible measure of the calibration of a given peptide identification method. Furthermore, we show how our quality test can be useful in the development of novel score functions.

Keyword
shotgun proteomics, peptide identification, calibration, p value, database search software, standard protein mix
National Category
Bioinformatics (Computational Biology)
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-68512 (URN)10.1021/pr1012619 (DOI)000290234800047 ()
Note

authorCount :3

Available from: 2012-01-07 Created: 2012-01-04 Last updated: 2017-12-08Bibliographically approved
2. Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics
Open this publication in new window or tab >>Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics
2013 (English)In: Journal of Proteomics, ISSN 1874-3919, E-ISSN 1876-7737, Vol. 80, 123-131 p.Article in journal (Refereed) Published
Abstract [en]

The analysis of a shotgun proteomics experiment results in a list of peptide-spectrum matches (PSMs) in which each fragmentation spectrum has been matched to a peptide in a database. Subsequently, most protein inference algorithms rank peptides according to the best-scoring PSM for each peptide. However, there is disagreement in the scientific literature on the best method to assess the statistical significance of the resulting peptide identifications. Here, we use a previously described calibration protocol to evaluate the accuracy of three different peptide-level statistical confidence estimation procedures: the classical Fisher's method, and two complementary procedures that estimate significance, respectively, before and after selecting the top-scoring PSM for each spectrum. Our experiments show that the latter method, which is employed by MaxQuant and Percolator, produces the most accurate, well-calibrated results.

Keyword
Shotgun proteomics, Peptides, Statistics
National Category
Bioinformatics (Computational Biology)
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-89873 (URN)10.1016/j.jprot.2012.12.007 (DOI)000317544700010 ()
Note

AuthorCount:4;

Available from: 2013-05-14 Created: 2013-05-14 Last updated: 2017-12-06Bibliographically approved
3. A cross-validation scheme for machine learning algorithms in shotgun proteomics
Open this publication in new window or tab >>A cross-validation scheme for machine learning algorithms in shotgun proteomics
2012 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 13, S3- p.Article in journal (Refereed) Published
Abstract [en]

Peptides are routinely identified from mass spectrometry-based proteomics experiments by matching observed spectra to peptides derived from protein databases. The error rates of these identifications can be estimated by target-decoy analysis, which involves matching spectra to shuffled or reversed peptides. Besides estimating error rates, decoy searches can be used by semi-supervised machine learning algorithms to increase the number of confidently identified peptides. As for all machine learning algorithms, however, the results must be validated to avoid issues such as overfitting or biased learning, which would produce unreliable peptide identifications. Here, we discuss how the target-decoy method is employed in machine learning for shotgun proteomics, focusing on how the results can be validated by cross-validation, a frequently used validation scheme in machine learning. We also use simulated data to demonstrate the proposed cross-validation scheme's ability to detect overfitting.

National Category
Bioinformatics (Computational Biology)
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-87133 (URN)10.1186/1471-2105-13-S16-S3 (DOI)000312714500003 ()
Note

AuthorCount:3;

Available from: 2013-01-29 Created: 2013-01-28 Last updated: 2017-12-06Bibliographically approved
4. Fast and Accurate Database Searches with MS-GF plus Percolator:  
Open this publication in new window or tab >>Fast and Accurate Database Searches with MS-GF plus Percolator:  
Show others...
2014 (English)In: Journal of Proteome Research, ISSN 1535-3893, E-ISSN 1535-3907, Vol. 13, no 2, 890-897 p.Article in journal (Refereed) Published
Abstract [en]

One can interpret fragmentation spectra stemming from peptides in mass-spectrometry-based proteomics experiments using so-called database search engines. Frequently, one also runs post-processors such as Percolator to assess the confidence, infer unique peptides, and increase the number of identifications. A recent search engine, MS-GF+, has shown promising results, due to a new and efficient scoring algorithm. However, MS-GF+ provides few statistical estimates about the peptide-spectrum matches, hence limiting the biological interpretation. Here, we enabled Percolator processing for MS-GF+ output and observed an increased number of identified peptides for a wide variety of data sets. In addition, Percolator directly reports p values and false discovery rate estimates, such as q values and posterior error probabilities, for peptide-spectrum matches, peptides, and proteins, functions that are useful for the whole proteomics community.

Keyword
shotgun proteomics, bioinformatics, machine learning, confidence estimation
National Category
Bioinformatics (Computational Biology)
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-99447 (URN)10.1021/pr400937n (DOI)000331164100049 ()
Funder
Swedish Research Council
Available from: 2014-01-13 Created: 2014-01-13 Last updated: 2017-12-06Bibliographically approved

Open Access in DiVA

thesis.pdf(493 kB)607 downloads
File information
File name FULLTEXT01.pdfFile size 493 kBChecksum SHA-512
172997747a6cae53d20f83f67a932f420498bed7e5f340a018f623817c9762c6ac7f9a259cf3113ee0f6fb21b411c1c1956736d9fdf72e95837fe3e3b64cc63b
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Granholm, Viktor
By organisation
Department of Biochemistry and Biophysics
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 607 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 395 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf