Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Evaluating Classification and Sampling Methods for Customer Churn Prediction under Varying Imbalance Levels
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2024 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

This thesis evaluates the performance of Random Forest and XGBoost classifiers combined with upsampling techniques (SMOTE, ADASYN, GNUS) for customer churn prediction in imbalanced datasets. Using datasets with churn ratios of 15%, 10%, 5%, and 1%, it simulates varying imbalance levels. The models are assessed using metrics such as F1-Score, ROC AUC, PR AUC, MCC, and Cohen's Kappa. By examining datasets with imbalance ratios ranging from moderate to extremely severe, this study addresses a critical research gap. The study offers valuable insights into enhancing predictive accuracy for highly imbalanced datasets while evaluating the behavior and robustness of various models across different levels of data imbalance, with a particular focus on applications within the telecommunications sector.

The study found that combining XGBoost with the SMOTE upsampling technique consistently delivers superior predictive performance across all evaluation metrics and levels of class imbalance, including extreme cases like a 1% churn rate. This combination demonstrated exceptional robustness and maintained high accuracy, making it the most effective strategy identified.

In contrast, Random Forest models showed adequate performance only in moderately imbalanced datasets and experienced significant declines under extreme imbalance, regardless of the upsampling method used. The research highlights that while metrics such as ROC AUC and PR AUC remained relatively stable, higher levels of imbalance led to fluctuations in MCC, Kappa, and F1-Scores for most models—except for the XGBoost with SMOTE combination, which consistently maintained high reliability and predictive accuracy.

Place, publisher, year, edition, pages
2024.
Keywords [en]
Customer Churn, Imbalance Challenges, XGBoost, Random Forest, Sampling Techniques
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:su:diva-242699OAI: oai:DiVA.org:su-242699DiVA, id: diva2:1955590
Available from: 2025-04-30 Created: 2025-04-30

Open Access in DiVA

fulltext(2398 kB)103 downloads
File information
File name FULLTEXT01.pdfFile size 2398 kBChecksum SHA-512
95abf1ba0ecfd97503379d390e4f177cac7b3a83f6674bcb310b579de0abdbd66ae02a1bc2f2c6c9075bb9e48028a2902ec104997baee87e8e23ef8338605f2e
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Imani, Mehdi
By organisation
Department of Computer and Systems Sciences
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 103 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 31 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf