Supervised Learning Techniques: A comparison of the Random Forest and the Support Vector Machine
Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
This thesis examines the performance of the support vector machine and the random forest models in the context of binary classification. The two techniques are compared and the outstanding one is used to construct a final parsimonious model. The data set consists of 33 observations and 89 biomarkers as features with no known dependent variable. The dependent variable is generated through k-means clustering, with a predefined final solution of two clusters. The training of the algorithms is performed using five-fold cross-validation repeated twenty times. The outcome of the training process reveals that the best performing versions of the models are a linear support vector machine and a random forest with six randomly selected features at each split. The final results of the comparison on the test set of these optimally tuned algorithms show that the random forest outperforms the linear kernel support vector machine. The former classifies all observations in the test set correctly whilst the latter classifies all but one correctly. Hence, a parsimonious random forest model using the top five features is constructed, which, to conclude, performs equally well on the test set compared to the original random forest model using all features.
Place, publisher, year, edition, pages
2016. , 57 p.
machine learning, biomarkers, cross-validation, receiver operating characteristic, k-means clustering, feature selection, binary classification
Probability Theory and Statistics
IdentifiersURN: urn:nbn:se:uu:diva-274768OAI: oai:DiVA.org:uu-274768DiVA: diva2:897594
Pharma Consulting Group
Subject / course