Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Predicting the Unobserved: A statistical analysis of missing data techniques for binary classification
Uppsala University, Disciplinary Domain of Humanities and Social Sciences, Faculty of Social Sciences, Department of Statistics.
2019 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

The aim of the thesis is to investigate how the classification performance of random forest and logistic regression differ, given an imbalanced data set with MCAR missing data. The performance is measured in terms of accuracy and sensitivity. Two analyses are performed: one with a simulated data set and one application using data from the Swedish population registries. The simulation study is created to have the same class imbalance at 1:5. The missing values are handled using three different techniques: complete case analysis, predictive mean matching and mean imputation. The thesis concludes that logistic regression and random forest are on average equally accurate, with some instances of random forest outperforming logistic regression. Logistic regression consistently outperforms random forest with regards to sensitivity. This implies that logistic regression may be the best option for studies where the goal is to accurately predict outcomes in the minority class. None of the missing data techniques stood out in terms of performance.

Place, publisher, year, edition, pages
2019. , p. 24
Keywords [en]
Random forest, logistic regression, imputation, classification, MCAR, missing data, imbalanced data
National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:uu:diva-388581OAI: oai:DiVA.org:uu-388581DiVA, id: diva2:1333954
Subject / course
Statistics
Educational program
Bachelor Programme in Political Science
Supervisors
Available from: 2019-07-02 Created: 2019-07-02 Last updated: 2019-07-02Bibliographically approved

Open Access in DiVA

fulltext(705 kB)15 downloads
File information
File name FULLTEXT01.pdfFile size 705 kBChecksum SHA-512
428d97a6e55f0416f4bc944d57b8ba43e3b78babbce7a102f8634c91238009720b835a49ef1995a77a9643b2ddb0ac2a4263c12825fc964d257f2a4534c396e5
Type fulltextMimetype application/pdf

By organisation
Department of Statistics
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar
Total: 15 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 59 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf