Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Statistical Learning with Imbalanced Data
Linköping University, Faculty of Arts and Sciences. Linköping University, Department of Computer and Information Science, The Division of Statistics and Machine Learning.
2017 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

In this thesis several sampling methods for Statistical Learning with imbalanced data have been implemented and evaluated with a new metric, imbalanced accuracy. Several modifications and new algorithms have been proposed for intelligent sampling: Border links, Clean Border Undersampling, One-Sided Undersampling Modified, DBSCAN Undersampling, Class Adjusted Jittering, Hierarchical Cluster Based Oversampling, DBSCAN Oversampling, Fitted Distribution Oversampling, Random Linear Combinations Oversampling, Center Repulsion Oversampling.

A set of requirements on a satisfactory performance metric for imbalanced learning have been formulated and a new metric for evaluating classification performance has been developed accordingly. The new metric is based on a combination of the worst class accuracy and geometric mean.

In the testing framework nonparametric Friedman's test and post hoc Nemenyi’s test have been used to assess the performance of classifiers, sampling algorithms, combinations of classifiers and sampling algorithms on several data sets. A new approach of detecting algorithms with dominating and dominated performance has been proposed with a new way of visualizing the results in a network.

From experiments on simulated and several real data sets we conclude that: i) different classifiers are not equally sensitive to sampling algorithms, ii) sampling algorithms have different performance within specific classifiers, iii) oversampling algorithms perform better than undersampling algorithms, iv) Random Oversampling and Random Undersampling outperform many well-known sampling algorithms, v) our proposed algorithms Hierarchical Cluster Based Oversampling, DBSCAN Oversampling with FDO, and Class Adjusted Jittering perform much better than other algorithms, vi) a few good combinations of a classifier and sampling algorithm may boost classification performance, while a few bad combinations may spoil the performance, but the majority of combinations are not significantly different in performance.

Place, publisher, year, edition, pages
2017. , 90 p.
Keyword [en]
imbalanced learning, sampling algorithms, intelligent sampling
National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:liu:diva-139168ISRN: LIU-IDA/STAT-A--17/012—SEOAI: oai:DiVA.org:liu-139168DiVA: diva2:1119403
Subject / course
Statistics
Presentation
2017-06-02, Von Neumann, Hus B, IDA, Linköpings Universitet, LINKÖPING, 15:30 (English)
Supervisors
Examiners
Available from: 2017-07-04 Created: 2017-07-04 Last updated: 2017-07-04Bibliographically approved

Open Access in DiVA

fulltext(6184 kB)84 downloads
File information
File name FULLTEXT01.pdfFile size 6184 kBChecksum SHA-512
11bca581219a21909a8f7f21df90765fa650fa6ec0efead9d3964558fe333cea97e7b24404be563e05269412dd7608a2fe783d45d47a5714257b54b0d6146c97
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Shipitsyn, Aleksey
By organisation
Faculty of Arts and SciencesThe Division of Statistics and Machine Learning
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar
Total: 84 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 281 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf