Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Comparison of Resampling Techniques to Handle the Class Imbalance Problem in Machine Learning: Conversion prediction of Spotify Users - A Case Study
KTH, School of Computer Science and Communication (CSC).
KTH, School of Computer Science and Communication (CSC).
2017 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesisAlternative title
Samplingmetoder för att hantera obalanserade klasser i maskininlärning : En fältstudie om prediktion av Spotify-användardes vilja att uppgradera produkten (Swedish)
Abstract [en]

Spotify uses a freemium business model, meaning that it has two main products, one free limited and one premium for paying customers. In this study we investigated machine learning models’ abilities, given user activity data, to predict conversion from free to premium. Predicting which of the users convert from free to premium was a class-imbalanced problem, meaning that the ratio of converters and non-converters was skewed. Three methods were investigated: logistic regression, decision trees, and gradient boosting trees. We also studied if different resampling methods, which balance the train datasets, can improve classification performance of the models. We showed that machine learning models are able to find patterns in user data that could be used to predict conversion. Additionally, for all our investigated classification methods, we showed that resampling increased the models’ performances. The methods with best performances in our study were logistic regression and gradient boosting tree trained with oversampled data up to equal numbers of converters and non-converters.

Abstract [sv]

I den här studien undersökte vi om det går att, givet användardata från Spotifyanvändare, prediktera vilka användare som konverterar från gratisversionen till premiumversionen. Eftersom det finns fler användare som inte konverterar än som konverterar, var detta ett problem med obalancerade klasser. Obalancerade klasser är ett välkänt problem inom maskininlärning. Tre maskininlärningsmetoder undersöktes: Logistic regression, Decision trees och Gradient Boosting Trees. Förbehandlingsmetoder som leder till att träningsdata får jämnare fördelning mellan klasserna undersöktes. Detta för att se om sådana förbehandlingar kunde öka modellernas förmåga att klassificera nya användare. Vi visade att det var möjligt att med maskininlärningsmetoder, givet användardata, hitta mönster i data som kunde användas för att prediktera vilka användare som konverterar. För alla tre maskininlärningsmetoder visade det sig att förbehandling av träningsdata till jämnare fördelning mellan klasserna gav bättre resultat. Av de undersökta modellerna presterade Logistic regression och Gradient Boosting Tree bäst då de tränats med förbehandlad data, så att slumpmässiga dubbletter av användare som konverterat lagts till i datasetet upp till helt jämn fördelning.

Place, publisher, year, edition, pages
2017.
National Category
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-208876OAI: oai:DiVA.org:kth-208876DiVA: diva2:1108465
Subject / course
Computer Science
Educational program
Master of Science in Engineering - Computer Science and Technology
Supervisors
Examiners
Available from: 2017-06-17 Created: 2017-06-12 Last updated: 2017-06-17Bibliographically approved

Open Access in DiVA

fulltext(566 kB)12 downloads
File information
File name FULLTEXT02.pdfFile size 566 kBChecksum SHA-512
52cf20e3492295c8fb7520f28351df84ca2607fdf0e894f00b14125d93a352b45028c5b128906b17176ba80d9a527bf59df2767f83715855ec0f8f0e67c26cf9
Type fulltextMimetype application/pdf

By organisation
School of Computer Science and Communication (CSC)
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 12 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 17 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf