Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Statistical and machine learning methods to analyze large-scale mass spectrometry data
KTH, School of Biotechnology (BIO), Gene Technology.ORCID iD: 0000-0002-5401-5553
2016 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

As in many other fields, biology is faced with enormous amounts ofdata that contains valuable information that is yet to be extracted. The field of proteomics, the study of proteins, has the luxury of having large repositories containing data from tandem mass-spectrometry experiments, readily accessible for everyone who is interested. At the same time, there is still a lot to discover about proteins as the main actors in cell processes and cell signaling.

In this thesis, we explore several methods to extract more information from the available data using methods from statistics and machine learning. In particular, we introduce MaRaCluster, a new method for clustering mass spectra on large-scale datasets. This method uses statistical methods to assess similarity between mass spectra, followed by the conservative complete-linkage clustering algorithm.The combination of these two resulted in up to 40% more peptide identifications on its consensus spectra compared to the state of the art method.

Second, we attempt to clarify and promote protein-level false discovery rates (FDRs). Frequently, studies fail to report protein-level FDRs even though the proteins are actually the entities of interest. We provided a framework in which to discuss protein-level FDRs in a systematic manner to open up the discussion and take away potential hesitance. We also benchmarked some scalable protein inference methods and included the best one in the Percolator package. Furthermore, we added functionality to the Percolator package to accommodate the analysis of studies in which many runs are aggregated. This reduced the run time for a recent study regarding a draft human proteome from almost a full day to just 10 minutes on a commodity computer, resulting in a list of proteins together with their corresponding protein-level FDRs.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2016. , vi, 44 p.
Series
TRITA-BIO-Report, ISSN 1654-2312 ; 2016:3
Keyword [en]
mass spectrometry - LC-MS/MS, statistical analysis, data processing and analysis, protein inference, large-scale studies, simulation
National Category
Bioinformatics and Systems Biology
Research subject
Biotechnology
Identifiers
URN: urn:nbn:se:kth:diva-185149ISBN: 978-91-7595-933-7 (print)OAI: oai:DiVA.org:kth-185149DiVA: diva2:918722
Presentation
2016-05-03, Pascal, våning 6 i Gamma-huset, Science for Life Laboratory, Tomtebodavägen 23, Solna, 13:00 (English)
Opponent
Supervisors
Note

QC 20160412

Available from: 2016-04-12 Created: 2016-04-11 Last updated: 2016-04-12Bibliographically approved
List of papers
1. MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics
Open this publication in new window or tab >>MaRaCluster: A Fragment Rarity Metric for Clustering Fragment Spectra in Shotgun Proteomics
2016 (English)In: Journal of Proteome Research, ISSN 1535-3893, E-ISSN 1535-3907, Vol. 15, no 3, 713-720 p.Article in journal (Refereed) Published
Abstract [en]

Shotgun proteomics experiments generate large amounts of fragment spectra as primary data, normally with high redundancy between and within experiments. Here, we have devised a clustering technique to identify fragment spectra stemming from the same species of peptide. This is a powerful alternative method to traditional search engines for analyzing spectra, specifically useful for larger scale mass spectrometry studies. As an aid in this process, we propose a distance calculation relying on the rarity of experimental fragment peaks, following the intuition that peaks shared by only a few spectra offer more evidence than peaks shared by a large number of spectra. We used this distance calculation and a complete-linkage scheme to cluster data from a recent large-scale mass spectrometry-based study. The clusterings produced by our method have up to 40% more identified peptides for their consensus spectra compared to those produced by the previous state-of-the-art method. We see that our method would advance the construction of spectral libraries as well as serve as a tool for mining large sets of fragment spectra. The source code and Ubuntu binary packages are available at https://github.com/ statisticalbiotechnology/maracluster (under an Apache 2.0 license).

Place, publisher, year, edition, pages
American Chemical Society (ACS), 2016
Keyword
Mass spectrometry, proteomics, hierarchical clustering bioinformatics, database search, spectral archives, spectral libraries
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:kth:diva-184544 (URN)10.1021/acs.jproteome.5b00749 (DOI)000371754100005 ()26653874 (PubMedID)2-s2.0-84960456163 (Scopus ID)
Funder
Science for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20160406

Available from: 2016-04-06 Created: 2016-04-01 Last updated: 2017-11-30Bibliographically approved
2. How to talk about protein-level false discovery rates in shotgun proteomics
Open this publication in new window or tab >>How to talk about protein-level false discovery rates in shotgun proteomics
(English)Manuscript (preprint) (Other academic)
Abstract [en]

A frequently sought output from a shotgun proteomics experiment is a list of proteins that we believe to have been present in the analyzed sample before proteolytic digestion. The standard technique to control for errors in such lists is to enforce a preset threshold for the false discovery rate. Many researchers consider protein-level false discovery rates a difficult and vague concept, as the measurement entities, spectra, are manifestations of peptides and not proteins. Here, we argue that this confusion is unnecessary and provide a framework on how to think about protein-level false discovery rates, starting from its basic principle: the null hypothesis. Specifically, we point out that two competing null hypotheses are used concurrently in today's protein inference methods, which has gone unnoticed by many. Using simulations of a shotgun proteomics experiment, we show how confusing one null hypothesis for the other can lead to serious discrepancies in the false discovery rate. Furthermore, we demonstrate how the same simulations can be used to verify false discovery rate estimates of protein inference methods. In particular, we show that, for a simple protein inference method, decoy models can be used to accurately estimate protein-level false discovery rates for both competing null hypotheses.

Keyword
mass spectrometry - LC-MS/MS, statistical analysis, data processing and analysis, protein inference, simulation
National Category
Bioinformatics and Systems Biology
Research subject
Biotechnology
Identifiers
urn:nbn:se:kth:diva-185116 (URN)
Note

QC 20160412

Available from: 2016-04-11 Created: 2016-04-11 Last updated: 2016-04-12Bibliographically approved
3. Fast and accurate protein false discovery rates on large-scale proteomics data sets with Percolator 3.0
Open this publication in new window or tab >>Fast and accurate protein false discovery rates on large-scale proteomics data sets with Percolator 3.0
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Percolator is a widely used software tool that increases yield in shotgun proteomics experiments and assigns reliable statistical confidence measures, such as q values and posterior error probabilities, to peptides and peptide-spectrum matches (PSMs) from such experiments. Percolator's processing speed has been sufficient for typical data sets consisting of hundreds of thousands of PSMs. With our new scalable approach, we can now also analyze millions of PSMs in a matter of minutes on a commodity computer. Furthermore,with the increasing awareness for the need for reliable statistics on the protein level, we compared several easy-to-understand protein inference methods and implemented the best-performing method - grouping proteins by their corresponding sets of theoretical peptides and then considering only the best-scoring peptide for each protein - in the Percolator package. We used Percolator 3.0 to analyze the data from a recent study of the draft human proteome containing 25 million spectra (PM:24870542).

The source code and Ubuntu, Windows, MacOS and Fedora binary packages are available from http://percolator.ms/ under an Apache 2.0 license.

Keyword
mass spectrometry - LC-MS/MS, statistical analysis, data processing and analysis, protein inference, large-scale studies
National Category
Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:kth:diva-185148 (URN)
Note

QC 20160412

Available from: 2016-04-11 Created: 2016-04-11 Last updated: 2016-04-12Bibliographically approved

Open Access in DiVA

fulltext(1281 kB)249 downloads
File information
File name FULLTEXT01.pdfFile size 1281 kBChecksum SHA-512
24790643e6c2ade4a78fa07ac7745ccf860af5bfac08a30f2d9beca4f094b237448a5c417dd729cea6d35b62d882dc2e46791b150f05a4f13967c6ee8719e07f
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
The, Matthew
By organisation
Gene Technology
Bioinformatics and Systems Biology

Search outside of DiVA

GoogleGoogle Scholar
Total: 249 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 118 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf