Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Machine Learning for Predicting Cancer Endpoints from Bulk Omics Data: Generalizing Knowledge from Various Modalities Across Domains
University of Skövde, School of Bioscience. University of Skövde, Systems Biology Research Environment. (Translational Bioinformatics)ORCID iD: 0000-0003-4191-8435
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Cancer remains one of the leading causes of death and is a major burden on patients and healthcare systems. One difficulty for finding effective treatment and matching patients to the right treatment strategy is the complexity of tumor biology. Machine learning holds the potential to learn patterns from data generated by high-throughput technologies, such as RNA-sequencing, that can elucidate the mechanisms underlying cancers and make clinically relevant predictions. In this thesis, we investigate the modeling of cancer with machine learning approaches from different molecular perspectives. First, we review the literature on the fusion of biomedical modalities with multimodal deep neural networks. In this review, we provide a descriptive overview, propose a novel taxonomy, and identify relevant research gaps. Moreover, for models to be applicable to clinical practice, they must be robust to shifts in the distribution patients are sampled from. Such shifts can stem from differences in the underlying biology or technical variation introduced during the processing of the biological material. Therefore, in two studies, we investigate domain generalization of machine learning models trained with bulk RNA-sequencing data to predict cancer survival endpoints. First, we show that deep learning-based domain generalization methods developed on non-molecular data improve robustness to distributional shifts on molecular data. We test these methods by predicting overall and recurrence free survival of breast cancer patients with subgroup shifts between source and target domains. Next, we show that relative representations of normalized count values, such as binning or ranking of expression values within a single sample, can increase domain generalization. We test these approaches in three experiments on breast, brain, and ovarian cancer. In a final study, we show that cancer stage can be predicted from circulating microRNA data with machine learning models, providing a proof of concept for this application. Overall, the work in this thesis supports making machine learning models more applicable to clinical practice by providing empirical evidence of methods improving the modeling of cancer biology. Continuing to study domain generalization of models in clinical practice and to develop methods for robustness are highlighted as future work.

Place, publisher, year, edition, pages
Skövde: University of Skövde , 2025. , p. xi, 147
Series
Dissertation Series ; 63
National Category
Cancer and Oncology Bioinformatics (Computational Biology) Other Computer and Information Science
Research subject
Bioinformatics
Identifiers
URN: urn:nbn:se:his:diva-25131Libris ID: q9jbkb73nm7d2kp0ISBN: 978-91-987907-9-5 (print)ISBN: 978-91-989080-0-8 (electronic)OAI: oai:DiVA.org:his-25131DiVA, id: diva2:1957293
Public defence
2025-06-04, G110, University of Skövde Building G, Skövde, 13:00 (English)
Opponent
Supervisors
Note

Ett av fyra delarbeten (övriga se rubriken Delarbeten/List of papers):

3. Stahlschmidt, Sören Richard, Synnergren, Jane, and Giovannucci, Andrea (2025). “Relative Representations of RNA-seq Data Improve Domain Generalization of Machine Learning Models for Cancer Prognosis”. In: Under Submission.

Publications with low relevance:

5. Johansson, Markus, Stahlschmidt, Sören Richard, Heydarkhan-Hagvall, Sepideh, Jeppsson, Anders, Holmgren, Gustav, Sartipy, Peter, and Synnergren, Jane (2025). “Uncovering the transcriptomic landscape of cardiac hypertrophy using single-cell RNA sequencing and machine learning”. In: Under Submission.

6. Lyubetskaya, Anna et al. (2025). “In situ multi-modal characterization of pancreatic cancer reveals tumor cell identity as a defining factor of the surrounding microenvironment”. In: Under Submission.

7. Marzec-Schmidt, Katarzyna, Ghosheh, Nidal, Stahlschmidt, Sören Richard, Küppers-Munther, Barbara, Synnergren, Jane, and Ulfenborg, Benjamin (2023). “Artificial Intelligence Supports Automated Characterization of Differentiated Human Pluripotent Stem Cells”. In: Stem Cells 41.9, pp. 850–861. DOI:10. 1093/stmcls/sxad049. 

Available from: 2025-05-12 Created: 2025-05-09 Last updated: 2025-05-21Bibliographically approved
List of papers
1. Multimodal deep learning for biomedical data fusion: a review
Open this publication in new window or tab >>Multimodal deep learning for biomedical data fusion: a review
2022 (English)In: Briefings in Bioinformatics, ISSN 1467-5463, E-ISSN 1477-4054, Vol. 23, no 2, article id bbab569Article, review/survey (Refereed) Published
Abstract [en]

Biomedical data are becoming increasingly multimodal and thereby capture the underlying complex relationships among biological processes. Deep learning (DL)-based data fusion strategies are a popular approach for modeling these nonlinear relationships. Therefore, we review the current state-of-the-art of such methods and propose a detailed taxonomy that facilitates more informed choices of fusion strategies for biomedical applications, as well as research on novel methods. By doing so, we find that deep fusion strategies often outperform unimodal and shallow approaches. Additionally, the proposed subcategories of fusion strategies show different advantages and drawbacks. The review of current methods has shown that, especially for intermediate fusion strategies, joint representation learning is the preferred approach as it effectively models the complex interactions of different levels of biological organization. Finally, we note that gradual fusion, based on prior biological knowledge or on search strategies, is a promising future research path. Similarly, utilizing transfer learning might overcome sample size limitations of multimodal data sets. As these data sets become increasingly available, multimodal DL approaches present the opportunity to train holistic models that can learn the complex regulatory dynamics behind health and disease.

Place, publisher, year, edition, pages
Oxford University Press, 2022
Keywords
data integration, deep neural networks, fusion strategies, multi-omics, multimodal machine learning, representation learning
National Category
Bioinformatics (Computational Biology)
Research subject
Bioinformatics
Identifiers
urn:nbn:se:his:diva-20873 (URN)10.1093/bib/bbab569 (DOI)000804196500091 ()35089332 (PubMedID)2-s2.0-85127534700 (Scopus ID)
Funder
Knowledge Foundation, 20170302Knowledge Foundation, 20200014
Note

CC BY-NC 4.0

Corresponding author: Sören Richard Stahlschmidt. Systems Biology Research Center, University of Skövde, Skövde, Sweden. E-mail: soren.richard.stahlschmidt@his.se

Published: 28 January 2022

This work was supported by the University of Skövde, Sweden under grants from the Knowledge Foundation (20170302, 20200014).

Available from: 2022-01-31 Created: 2022-01-31 Last updated: 2025-05-09Bibliographically approved
2. Domain Generalization of Deep Learning Models Under Subgroup Shift in Breast Cancer Prognosis
Open this publication in new window or tab >>Domain Generalization of Deep Learning Models Under Subgroup Shift in Breast Cancer Prognosis
2024 (English)In: 2024 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), IEEE, 2024Conference paper, Published paper (Refereed)
Abstract [en]

Making breast cancer prognosis from gene expression profiles of the primary tumor has become a promising application of deep learning. Yet, to be relevant to real world applications in the clinic and for knowledge discovery, these models must be robust to common distribution shifts. In this study, we evaluate recently proposed methods for improving domain and subgroup shifts. We test the in-distribution and out-of-distribution generalization of multiple episode learning, stochastic weight averaging, group distributionally robust optimization, and a subsampling scheme on one training and four external breast cancer prognosis datasets. The evaluation found that the methods can, to various degrees, improve generalization across domains, although there remain, partially high, generalization gaps. Additionally, in-distribution and out-of-distribution generalization differs between clinical subtypes of breast cancer. Thus, we conclude that further research into methods specifically addressing challenges in breast cancer prognosis from gene expression data are warranted. 

Place, publisher, year, edition, pages
IEEE, 2024
Series
IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), ISSN 2994-9351, E-ISSN 2994-9408
Keywords
breast cancer, domain generalization, gene expression, subgroup shift, survival analysis, Contrastive Learning, Diseases, Lung cancer, Stochastic systems, Breast cancer prognosis, Gene expression profiles, Generalisation, Genes expression, Learning models, Real-world
National Category
Cancer and Oncology Bioinformatics and Computational Biology Other Computer and Information Science
Research subject
Bioinformatics; Skövde Artificial Intelligence Lab (SAIL)
Identifiers
urn:nbn:se:his:diva-24659 (URN)10.1109/CIBCB58642.2024.10702166 (DOI)2-s2.0-85207504799 (Scopus ID)979-8-3503-5663-2 (ISBN)979-8-3503-5664-9 (ISBN)
Conference
21st IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB 2024, 27-29 August 2024, Natal, Brazil
Funder
Knowledge Foundation, 20170302Knowledge Foundation, 20200014Swedish Research Council, 2022-06725
Note

© 2024 IEEE

Correspondence Address: S.R. Stahlschmidt; University of Skövde, Systems Biology Research Center, Skövde, Sweden; email: soren.richard.stahlschmidt@his.se

This work was supported by the University of Skövde, Sweden under grants from the Knowledge Foundation (20170302, 20200014). The computations were enabled by resources provided by Chalmers e-Commons at Chalmers and the National Academic Infrastructure for Supercomputing in Sweden (NAISS), partially funded by the Swedish Research Council through grant agreement no. 2022-06725.

Available from: 2024-11-07 Created: 2024-11-07 Last updated: 2025-05-09Bibliographically approved
3. Predicting Cancer Stage from Circulating microRNA: A Comparative Analysis of Machine Learning Algorithms
Open this publication in new window or tab >>Predicting Cancer Stage from Circulating microRNA: A Comparative Analysis of Machine Learning Algorithms
2023 (English)In: Bioinformatics and Biomedical Engineering: 10th International Work-Conference, IWBBIO 2023, Meloneras, Gran Canaria, Spain, July 12–14, 2023, Proceedings, Part I / [ed] Ignacio Rojas; Olga Valenzuela; Fernando Rojas Ruiz; Luis Javier Herrera; Francisco Ortuño, Cham: Springer, 2023, p. 103-115Conference paper, Published paper (Refereed)
Abstract [en]

In recent years, serum-based tests for early detection and detection of tissue of origin are being developed. Circulating microRNA has been shown to be a potential source of diagnostic information that can be collected non-invasively. In this study, we investigate circulating microRNAs as predictors of cancer stage. Specifically, we predict whether a sample stems from a patient with early stage (0-II) or late stage cancer (III-IV). We trained five machine learning algorithms on a data set of cancers from twelve different primary sites. The results showed that cancer stage can be predicted from circulating microRNA with a sensitivity of 71.73%, specificity of 79.97%, as well as positive and negative predictive value of 54.81% and 89.29%, respectively. Furthermore, we compared the best pan-cancer model with models specialized on individual cancers and found no statistically significant difference. Finally, in the best performing pan-cancer model 185 microRNAs were significant. Comparing the five most relevant circulating microRNAs in the best performing model with the current literature showed some known associations to various cancers. In conclusion, the study showed the potential of circulating microRNA and machine learning algorithms to predict cancer stage and thus suggests that further research into its potential as a non-invasive clinical test is warranted. 

Place, publisher, year, edition, pages
Cham: Springer, 2023
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 13919
Keywords
cancer stage, circulating microRNA, liquid biopsy, machine learning, Clinical research, Diseases, Forecasting, Learning algorithms, RNA, Cancer models, Comparative analyzes, Diagnostics informations, Late stage, Machine learning algorithms, Machine-learning, Potential sources
National Category
Bioinformatics (Computational Biology)
Research subject
Bioinformatics
Identifiers
urn:nbn:se:his:diva-23058 (URN)10.1007/978-3-031-34953-9_8 (DOI)001313788200008 ()2-s2.0-85164958861 (Scopus ID)978-3-031-34952-2 (ISBN)978-3-031-34953-9 (ISBN)
Conference
10th International Work-Conference on Bioinformatics and Biomedical Engineering, IWBBIO 2023 Meloneras 12 July 2023 through 14 July 2023 Code 297199
Funder
Knowledge Foundation, 20170302Knowledge Foundation, 20200014Swedish Research Council, 2022–06725
Note

Part of the book sub series: Lecture Notes in Bioinformatics (LNBI) Electronic ISSN 2366-6331 Print ISSN 2366-6323

© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.

This work was supported by the University of Skövde, Swede nunder grants from the Knowledge Foundation (20170302, 20200014). The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden (NAISS) at Chalmers University of Technology partially funded by the Swedish Research Council through grant agreement no. 2022–06725.

Available from: 2023-07-31 Created: 2023-07-31 Last updated: 2025-05-09Bibliographically approved

Open Access in DiVA

fulltext_only_paper_1_paper_4(5721 kB)120 downloads
File information
File name FULLTEXT05.pdfFile size 5721 kBChecksum SHA-512
bd54ba72b0aecfe492dc48a05bf40be3058417d31034b2db4e7130b941a4714ee208671ff6905dec8d51ff3ba7f9e227de9890e6e3071aa92bb14dab2b186443
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Stahlschmidt, Sören Richard
By organisation
School of BioscienceSystems Biology Research Environment
Cancer and OncologyBioinformatics (Computational Biology)Other Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 123 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 566 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf