Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
The impact of missing data imputation on HCC survival prediction: Exploring the combination of missing data imputation with data-level methods such as clustering and oversampling
KTH, School of Engineering Sciences (SCI).
KTH, School of Engineering Sciences (SCI).
2018 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

The area of data imputation, which is the process of replacing missing data with substituted values, has been covered quite extensively in recent years. The literature on the practical impact of data imputation however, remains scarce. This thesis explores the impact of some of the state of the art data imputation methods on HCC survival prediction and classification in combination with data-level methods such as oversampling. More specifically, it explores imputation methods for mixed-type datasets and their impact on a particular HCC dataset. Previous research has shown that, the newer, more sophisticated imputation methods outperform simpler ones when evaluated with normalized root mean square error (NRMSE). Contrary to intuition however, the results of this study show that when combined with other data-level methods such as clustering and oversampling, the differences in imputation performance does not always impact classification in any meaningful way. This might be explained by the noise that is introduced when generating synthetic data points in the oversampling process. The results also show that one of the more sophisticated imputation methods, namely MICE, is highly dependent on prior assumptions about the underlying distributions of the dataset. When those assumptions are incorrect, the imputation method performs poorly and has a considerable negative impact on classification.

Abstract [sv]

Forskningen kring data imputation, processen där man ersätter saknade data med substituerade värden, har varit omfattande de senaste åren. Litteraturen om den praktiska inverkan som data imputation metoder har på klassificering är dock otillräcklig. Det här kandidatexamensarbetet utforskar den inverkan som de nyare imputation metoderna har på HCC överlevnads klassificering i kombination med andra data-nivå metoder så som översampling. Mer specifikt, så utforskar denna studie imputations metoder för heterogena dataset och deras inverkan på ett specifikt HCC dataset. Tidigare forskning har visat att de nyare, mer sofistikerade imputations metoderna presterar bättre än de mer enkla metoderna när de utvärderas med normalized root mean square error (NRMSE). I motsats till intuition, så visar resultaten i denna studie att när imputation kombineras med andra data-nivå metoder så som översampling och klustring, så påverkas inte klassificeringen alltid på ett meningsfullt sätt. Detta kan förklaras med att brus introduceras i datasetet när syntetiska punkter genereras i översampling processen. Resultaten visar också att en av de mer sofistikerade imputation metoderna, nämligen MICE, är starkt beroende på tidigare antaganden som görs om de underliggande fördelningarna i datasetet. När dessa antaganden är inkorrekta så presterar imputations metoden dåligt och har en negativ inverkan på klassificering.

Place, publisher, year, edition, pages
2018. , p. 36
Series
TRITA-SCI-GRU ; 2018-127
Keywords [en]
missing data, imputation, HCC, survival prediction, oversampling
Keywords [sv]
saknade data, imputation, HCC, överlevnads klassificering, översampling
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:kth:diva-231023OAI: oai:DiVA.org:kth-231023DiVA, id: diva2:1221489
Supervisors
Examiners
Available from: 2018-06-20 Created: 2018-06-20 Last updated: 2018-06-20Bibliographically approved

Open Access in DiVA

fulltext(1010 kB)3 downloads
File information
File name FULLTEXT01.pdfFile size 1010 kBChecksum SHA-512
7421a3446f32ba455a86491bd9a6d311d90c1bd5c037fd39a7fea7b6dc2b048ac21658e1aa3e60c4b05ce3e12db548a18a980d458e816aef89f57a9b9b34cfeb
Type fulltextMimetype application/pdf

By organisation
School of Engineering Sciences (SCI)
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 3 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 18 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf