Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Cluster analysis on high dimensional RNA-seq data with applications to cancer research: An evaluation study
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics.
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics. Umeå University, Faculty of Social Sciences, Umeå School of Business and Economics (USBE), Statistics.
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics.
2019 (English)In: PLoS ONE, E-ISSN 1932-6203, Vol. 14, no 12, article id e0219102Article in journal (Refereed) Published
Abstract [en]

Background: Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance.

Results: In general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males.

Conclusions: The number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.

Place, publisher, year, edition, pages
San Francisco: Public Library of Science , 2019. Vol. 14, no 12, article id e0219102
Keywords [en]
Cancer, cluster analysis
National Category
Probability Theory and Statistics Bioinformatics and Systems Biology
Identifiers
URN: urn:nbn:se:umu:diva-167274DOI: 10.1371/journal.pone.0219102PubMedID: 31805048OAI: oai:DiVA.org:umu-167274DiVA, id: diva2:1385558
Available from: 2020-01-14 Created: 2020-01-14 Last updated: 2020-01-15Bibliographically approved
In thesis
1. cancer subtype identification using cluster analysis on high-dimensional omics data
Open this publication in new window or tab >>cancer subtype identification using cluster analysis on high-dimensional omics data
2020 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Identification and prediction of cancer subtypes are important parts in the development towards personalized medicine. By tailoring treatments, it is possible to decrease unnecessary suffering and reduce costs. Since the introduction of next generation sequencing techniques, the amount of data available for medical research has increased rapidly. The high dimensional omics data produced by various techniques requires statistical methods to transform data into information and knowledge.

All papers in this thesis are related to distinguishing of disease subtypes in patients with cancer using omics data. The high dimension and the complexity of sequencing data from tumor samples makes it necessary to pre—process the data.  We carry out comparisons of feature selection methods and clustering methods used for identification of cancer subtypes. In addition, we evaluate the effect that certain characteristics of the data have on the ability to identify cancer subtypes. The results show that no method outperforms the others in all cases and the relative ranking of methods is very dependent on the data. We also show that the benefit of receiving a more homogeneous data by analyzing genders separately can outweigh the possible drawbacks caused by smaller sample sizes. One of the major challenges when dealing with omics data from tumor samples is that the patients are generally a very heterogeneous group. Factors that lead to heterogeneity include age, gender, ethnicity and stage of disease. How big the effect size is for each of these factors might affect the ability to identify the subgroups of interest.

In omics data, the feature space is often large and how many of the features that are informative for the factors of interest will also affect the complexity of the problem. We present a novel clustering approach that can identify different clusters in different subsets of the feature space, which is applied on methylation data to create new potential biomarkers. It is shown that by combining clinical data with methylation data for patients with clear cell renal carcinoma, it is possible to improve the currently used prediction model for disease progression.  

Using unsupervised clustering techniques, we identify three molecular subtypes of prostate cancer bone metastases based on gene expression profiles. The robustness of the identified subtypes is confirmed by applying several clustering algorithms with very similar results.

 

Place, publisher, year, edition, pages
Umeå: Umeå universitet, 2020. p. 22
Series
Research report in mathematical statistics, ISSN 1653-0829 ; 70/20
Keywords
cluster analysis, cancer, classification
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:umu:diva-167275 (URN)978-91-7855-172-9 (ISBN)978-91-7855-173-6 (ISBN)
Public defence
2020-02-07, N460, Naturvetarhuset, Umeå, 09:15 (English)
Opponent
Supervisors
Available from: 2020-01-17 Created: 2020-01-14 Last updated: 2020-01-15Bibliographically approved

Open Access in DiVA

fulltext(2717 kB)20 downloads
File information
File name FULLTEXT01.pdfFile size 2717 kBChecksum SHA-512
0e50a57b87ff3e6c319d5e6560c3d2579e752c7e3d1244b9fe425e0522fbe9478d24241fcda5d487b0ec4b240c73f06e0c55304cf3518f5afd9f407bb44f0ce3
Type fulltextMimetype application/pdf

Other links

Publisher's full textPubMed

Search in DiVA

By author/editor
Vidman, LindaKällberg, DavidRydén, Patrik
By organisation
Department of Mathematics and Mathematical StatisticsStatistics
In the same journal
PLoS ONE
Probability Theory and StatisticsBioinformatics and Systems Biology

Search outside of DiVA

GoogleGoogle Scholar
Total: 20 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 46 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf