Evaluation of logistic regression and random forest classification based on prediction accuracy and metadata analysis
Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Model selection is an important part of classification. In this thesis we study the two classification models logistic regression and random forest. They are compared and evaluated based on prediction accuracy and metadata analysis. The models were trained on 25 diverse datasets. We calculated the prediction accuracy of both models using RapidMiner. We also collected metadata for the datasets concerning number of observations, number of predictor variables and number of classes in the response variable.
There is a correlation between performance of logistic regression and random forest with significant correlation of 0.60 and confidence interval [0.29 0.79]. The models appear to perform similarly across the datasets with performance more influenced by choice of dataset rather than model selection. Random forest with an average prediction accuracy of 81.66% performed better on these datasets than logistic regression with an average prediction accuracy of 73.07%. The difference is however not statistically significant with a p-value of 0.088 for Student's t-test.
Multiple linear regression analysis reveals none of the analysed metadata have a significant linear relationship with logistic regression performance. The regression of logistic regression performance on metadata has a p-value of 0.66. We get similar results with random forest performance. The regression of random forest performance on metadata has a p-value of 0.89. None of the analysed metadata have a significant linear relationship with random forest performance.
We conclude that the prediction accuracies of logistic regression and random forest are correlated. Random forest performed slightly better on the studied datasets but the difference is not statistically significant. The studied metadata does not appear to have a significant effect on prediction accuracy of either model.
Place, publisher, year, edition, pages
2014. , 42 p.
classification, logistic regression, random forest, metadata
IdentifiersURN: urn:nbn:se:lnu:diva-35126OAI: oai:DiVA.org:lnu-35126DiVA: diva2:724982
Subject / course
Applied Mahtematics Programme, 180 credits
2014-06-03, 09:15 (English)