Evaluating Classification and Sampling Methods for Customer Churn Prediction under Varying Imbalance Levels
2024 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
Student thesis
Abstract [en]
This thesis evaluates the performance of Random Forest and XGBoost classifiers combined with upsampling techniques (SMOTE, ADASYN, GNUS) for customer churn prediction in imbalanced datasets. Using datasets with churn ratios of 15%, 10%, 5%, and 1%, it simulates varying imbalance levels. The models are assessed using metrics such as F1-Score, ROC AUC, PR AUC, MCC, and Cohen's Kappa. By examining datasets with imbalance ratios ranging from moderate to extremely severe, this study addresses a critical research gap. The study offers valuable insights into enhancing predictive accuracy for highly imbalanced datasets while evaluating the behavior and robustness of various models across different levels of data imbalance, with a particular focus on applications within the telecommunications sector.
The study found that combining XGBoost with the SMOTE upsampling technique consistently delivers superior predictive performance across all evaluation metrics and levels of class imbalance, including extreme cases like a 1% churn rate. This combination demonstrated exceptional robustness and maintained high accuracy, making it the most effective strategy identified.
In contrast, Random Forest models showed adequate performance only in moderately imbalanced datasets and experienced significant declines under extreme imbalance, regardless of the upsampling method used. The research highlights that while metrics such as ROC AUC and PR AUC remained relatively stable, higher levels of imbalance led to fluctuations in MCC, Kappa, and F1-Scores for most models—except for the XGBoost with SMOTE combination, which consistently maintained high reliability and predictive accuracy.
Place, publisher, year, edition, pages
2024.
Keywords [en]
Customer Churn, Imbalance Challenges, XGBoost, Random Forest, Sampling Techniques
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:su:diva-242699OAI: oai:DiVA.org:su-242699DiVA, id: diva2:1955590
2025-04-302025-04-30