Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Computational methods to estimate error rates forpeptide identifications in mass spectrometry-based proteomics
KTH, School of Engineering Sciences (SCI), Mathematics (Dept.), Numerical Analysis, NA.
2013 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Beräkningsmetoder för att uppskatta felfrekvensen hos peptididentifikationer inom masspektrometri-baserad proteomik (Swedish)
Abstract [en]

In the field of proteomics, tandem mass spectrometry is the core technology which promises to identify peptide components within complex mixtures on a large scale. Currently the bottleneck is to reduce the error rates and assign accurate statistical estimates of peptide identifications.

In this work, we introduce the techniques of identifying chimeric spectra, where two or more precursor ions with similar mass and retention time are co-fragmented and sequenced by the MS/MS instrument. Based on this, we try to analyze the factor which leads to the high error rate of identifications. We show that chimeric spectra have high correlations with the ranking scores and can reduce the number of positive identifications.

Additionally, we address the problem of assigning a posterior error probability (PEP) to the individual peptide-spectrum matches (PSMs) that are obtained via search engines. This problem is computationally more difficult than estimating the error rate associated with a large collection of PSMs, such as false discovery rate (FDR). Existing methods rely on parametric or semiparametric models of the underlying score distribution as preassumption.We provide a so-called kernel logistic regression procedure without any explicit assumptions about the score distribution. Based on an appropriate positive definite Gaussian kernel, the resulting PEP estimate is proven to be robust by achieving a close correspondence between the PEP-derived q-values and FDR-derived q-values. Furthermore, we also accept at least 200 more significant PSMs with setting a threshold based on PEP-derived q-values compared to FDR-derived q-values. Finally, we show that this kernel logistic regression method is well established in the statistics literature and it can produce accurate PEP estimates for different types of PSM score functions and data.

Abstract [sv]

Tandemmasspektrometri (MS/MS) är kärnan i proteomikstudier som försöker att identifiera peptider inom komplexa proteinlösningar i stor skala. För närvarande är flaskhalsen att minska felprocenten av peptideidentifikationerna, samt att tilldela noggranna statistiska skattningar av dessa. I detta arbete presenterar vi metoder för att identifiera chimära spektra, där två eller flera produktjoner med liknande massa och retentionstid är samfragmenterade och sekvenserade i ett MS/MS-instrument. Hypotesen är att dessa sam-fragmenterade joner är en anledning till den höga felfrekvensen hos peptideidentifikationer. Vi visar att chimära spektra har korrelerar med identifikationskvalitéten och kan minska antalet positiva identifikationer.

Dessutom undersöker vi problemet med att tilldela en posteriori felsannolikhet (posterior error probability, PEP) till individuella peptid-spektrum matcher (PSM) som erhålls genom sökmotorer. Detta problem är beräkningsmässigt svårare än att uppskatta felfrekvensen med en stor samling av PSM, såsom false discover rate (FDR). Befintliga metoder förlitar sig på parametriska eller delvis-parametriska modeller av den underliggande fördelningen av poäng till identifikationer. Vi tillhandahåller en kernel-logistisk regressionsmodell utan några explicita antaganden av fördelningen. Baserat på en lämpligt positiv definit Gausskärna, har den resulterande PEP-uppskattningen visat sig vara robust genom att uppnå ett nära samband mellan PEP-härledda q-värden och FDR-härledda q-värden. Slutligen visar vi att denna icke-parametrisk kernel-logistisk regression metod är väl etablerad i den statistiska litteraturen och kan producera noggranna PEP uppskattningar för olika typer av PSM värderingar

Place, publisher, year, edition, pages
2013. , 42 p.
Series
TRITA-MAT-E, 2013:32
National Category
Computational Mathematics
Identifiers
URN: urn:nbn:se:kth:diva-124027OAI: oai:DiVA.org:kth-124027DiVA: diva2:632708
External cooperation
Science for Life Laboratory (SciLifeLab), Stockholm
Subject / course
Numerical Analysis
Educational program
Master of Science - Computer Simulation for Science and Engineering
Uppsok
Physics, Chemistry, Mathematics
Supervisors
Examiners
Available from: 2013-06-25 Created: 2013-06-25 Last updated: 2013-06-25Bibliographically approved

Open Access in DiVA

fulltext(3440 kB)532 downloads
File information
File name FULLTEXT01.pdfFile size 3440 kBChecksum SHA-512
4336ece421ab5b704f94c79d6ecbbf66230388a12a4b51b4c4df379ef4d0f66e0357eb675d1738a44e1da48f9d2f16a1cacb1ee4c98da65fc8625ee65c9c424c
Type fulltextMimetype application/pdf

By organisation
Numerical Analysis, NA
Computational Mathematics

Search outside of DiVA

GoogleGoogle Scholar
Total: 532 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 337 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf