Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Evaluation of Calibration Methods to Adjust for Infrequent Values in Data for Machine Learning
Dalarna University, School of Technology and Business Studies, Microdata Analysis.
2018 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

The performance of supervised machine learning algorithms is highly dependent on the distribution of the target variable. Infrequent values are more di_cult to predict, as there are fewer examples for the algorithm to learn patterns that contain those values. These infrequent values are a common problem with real data, being the object of interest in many _elds such as medical research, _nance and economics, just to mention a few. Problems regarding classi_cation have been comprehensively studied. For regression, on the other hand, few contributions are available. In this work, two ensemble methods from classi_cation are adapted to the regression case. Additionally, existing oversampling techniques, namely SmoteR, are tested. Therefore, the aim of this research is to examine the inuence of oversampling and ensemble techniques over the accuracy of regression models when predicting infrequent values. To assess the performance of the proposed techniques, two data sets are used: one concerning house prices, while the other regards patients with Parkinson's Disease. The _ndings corroborate the usefulness of the techniques for reducing the prediction error of infrequent observations. In the best case, the proposed Random Distribution Sample Ensemble reduced the overall RMSE by 8.09% and the RMSE for infrequent values by 6.44% when compared with the best performing benchmark for the housing data set.

Place, publisher, year, edition, pages
2018.
Keywords [en]
Data mining, resampling, ensemble.
National Category
Economics and Business
Identifiers
URN: urn:nbn:se:du-28134OAI: oai:DiVA.org:du-28134DiVA, id: diva2:1231432
Available from: 2018-07-06 Created: 2018-07-06

Open Access in DiVA

fulltext(1537 kB)31 downloads
File information
File name FULLTEXT01.pdfFile size 1537 kBChecksum SHA-512
c383b9bee92cb6b21b858561543a28fb4587434b16cd33e873e531fade40de6ba333399d47e1235b91ef1489f5106a1a16284c7e5af8e65f5d921be75511c0a8
Type fulltextMimetype application/pdf

By organisation
Microdata Analysis
Economics and Business

Search outside of DiVA

GoogleGoogle Scholar
Total: 31 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 55 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf