Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Anomaly Detection in Categorical Data with Interpretable Machine Learning: A random forest approach to classify imbalanced data
Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
2019 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Metadata refers to "data about data", which contains information needed to understand theprocess of data collection. In this thesis, we investigate if metadata features can be usedto detect broken data and how a tree-based interpretable machine learning algorithm canbe used for an effective classification. The goal of this thesis is two-fold. Firstly, we applya classification schema using metadata features for detecting broken data. Secondly, wegenerate the feature importance rate to understand the model’s logic and reveal the keyfactors that lead to broken data.

The given task from the Swedish automotive company Veoneer is a typical problem oflearning from extremely imbalanced data set, with 97 percent of data belongs healthy dataand only 3 percent of data belongs to broken data. Furthermore, the whole data set containsonly categorical variables in nominal scales, which brings challenges to the learningalgorithm. The notion of handling imbalanced problem for continuous data is relativelywell-studied, but for categorical data, the solution is not straightforward.

In this thesis, we propose a combination of tree-based supervised learning and hyperparametertuning to identify the broken data from a large data set. Our methods arecomposed of three phases: data cleaning, which is eliminating ambiguous and redundantinstances, followed by the supervised learning algorithm with random forest, lastly, weapplied a random search for hyper-parameter optimization on random forest model.

Our results show empirically that tree-based ensemble method together with a randomsearch for hyper-parameter optimization have made improvement to random forest performancein terms of the area under the ROC. The model outperformed an acceptableclassification result and showed that metadata features are capable of detecting brokendata and providing an interpretable result by identifying the key features for classificationmodel.

Place, publisher, year, edition, pages
2019. , p. 49
Keywords [en]
machine learning, decision tree, imbalanced data, anomaly detection, random forest
Keywords [sv]
maskininlärning, beslut träd, obalanserat data, anomalitetsdetektering
National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:liu:diva-158185ISRN: LIU-IDA/STAT-G--19/001—SEOAI: oai:DiVA.org:liu-158185DiVA, id: diva2:1330907
Subject / course
Statistics
Presentation
2019-06-04, Alan Turing, Hus E, Campus Valla, Linköping, 10:15 (English)
Supervisors
Examiners
Available from: 2019-06-26 Created: 2019-06-26 Last updated: 2019-06-26Bibliographically approved

Open Access in DiVA

fulltext(685 kB)41 downloads
File information
File name FULLTEXT01.pdfFile size 685 kBChecksum SHA-512
6286eb67c0ffffe4c3d4601b31194fee112502d742220873c8ab6916b7eb0f040ac3114a71f3daeffe406a3f08f64cb3c7b1ea2879800bbb026e4245a6688e70
Type fulltextMimetype application/pdf

By organisation
The Division of Statistics and Machine Learning
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar
Total: 41 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 182 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf