Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Predicting the area of industry: Using machine learning to classify SNI codes based on business descriptions, a degree project at SCB
Umeå University, Faculty of Social Sciences, Umeå School of Business and Economics (USBE), Statistics.
Umeå University, Faculty of Social Sciences, Umeå School of Business and Economics (USBE), Statistics.
2019 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesisAlternative title
Att prediktera näringsgrensindelning : Ett examensarbete om tillämpningavmaskininlärning för att klassificeraSNI-koder utifrån företagsbeskrivningarhos SCB (Swedish)
Abstract [en]

This study is a part of an experimental project at Statistics Sweden,which aims to, with the use of natural language processing and machine learning, predict Swedish businesses’ area of industry codes, based on their business descriptions. The response to predict consists of the most frequent 30 out of 88 main groups of Swedish standard industrial classification (SNI) codes that each represent a unique area of industry. The transformation from business description text to numerical features was done through the bag-of-words model. SNI codes are set when companies are founded, and due to the human factor, errors can occur. Using data from the Swedish Companies Registration Office, the purpose is to determine if the method of gradient boosting can provide high enough classification accuracy to automatically set the correct SNI codes that differ from the actual response. Today these corrections are made manually. The best gradient boosting model was able to correctly classify 52 percent of the observations, which is not considered high enough to implement automatic code correction into a production environment.

Place, publisher, year, edition, pages
2019.
Keywords [en]
machine learning, classification, gradient boosting, data analysis, NLP, SNI, SCB
National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:umu:diva-160806OAI: oai:DiVA.org:umu-160806DiVA, id: diva2:1329995
External cooperation
SCB
Available from: 2019-06-25 Created: 2019-06-25 Last updated: 2019-06-25Bibliographically approved

Open Access in DiVA

fulltext(665 kB)47 downloads
File information
File name FULLTEXT01.pdfFile size 665 kBChecksum SHA-512
0e12d3ae6e081406f9d1a6631a0669779d3a88c0f22971efa72495c24f6b8f9fdeb85229db49d2262e24e64f72bc9ec1d07d333dc6361f94e495d9b7513a4cbd
Type fulltextMimetype application/pdf

By organisation
Statistics
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar
Total: 47 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 72 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf