Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis
Linnaeus University, Faculty of Technology, Department of Mathematics.
2014 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Model selection is an important part of classification. In this thesis we study the two classification models logistic regression and random forest. They are compared and evaluated based on prediction accuracy and metadata analysis. The models were trained on 25 diverse datasets. We calculated the prediction accuracy of both models using RapidMiner. We also collected metadata for the datasets concerning number of observations, number of predictor variables and number of classes in the response variable.    

There is a correlation between performance of logistic regression and random forest with significant correlation of 0.60 and confidence interval [0.29 0.79]. The models appear to perform similarly across the datasets with performance more influenced by choice of dataset rather than model selection.     Random forest with an average prediction accuracy of 81.66% performed better on these datasets than logistic regression with an average prediction accuracy of 73.07%. The difference is however not statistically significant with a p-value of 0.088 for Student's t-test.    

Multiple linear regression analysis reveals none of the analysed metadata have a significant linear relationship with logistic regression performance. The regression of logistic regression performance on metadata has a p-value of 0.66. We get similar results with random forest performance. The regression of random forest performance on metadata has a p-value of 0.89. None of the analysed metadata have a significant linear relationship with random forest performance.    

We conclude that the prediction accuracies of logistic regression and random forest are correlated. Random forest performed slightly better on the studied datasets but the difference is not statistically significant. The studied metadata does not appear to have a significant effect on prediction accuracy of either model.

Place, publisher, year, edition, pages
2014. , 42 p.
Keyword [en]
classification, logistic regression, random forest, metadata
National Category
Mathematics
Identifiers
URN: urn:nbn:se:lnu:diva-35126OAI: oai:DiVA.org:lnu-35126DiVA: diva2:724982
Subject / course
Matematisk statistik
Educational program
Applied Mahtematics Programme, 180 credits
Presentation
2014-06-03, 09:15 (English)
Supervisors
Examiners
Available from: 2014-06-16 Created: 2014-06-13 Last updated: 2014-06-16Bibliographically approved

Open Access in DiVA

awalinderthesis(959 kB)3052 downloads
File information
File name FULLTEXT01.pdfFile size 959 kBChecksum SHA-512
67bd4795dfa954a59cedbffa8884f6b51cfeb26df9265e488cae4d93ee07c448fe91e547af0fa0ccdedae3bfc9022714111c5a8d4870b4eacdcc47a37b5d25ee
Type fulltextMimetype application/pdf

By organisation
Department of Mathematics
Mathematics

Search outside of DiVA

GoogleGoogle Scholar
Total: 3052 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 521 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf