Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Utilize OCR text to extract receipt data and classify receipts with common Machine Learning algorithms
Linköping University, Department of Computer and Information Science.
Linköping University, Department of Computer and Information Science.
2018 (English)Independent thesis Basic level (degree of Bachelor), 10,5 credits / 16 HE creditsStudent thesisAlternative title
Använda OCR-text för att extrahera kvittodata och klassificera kvitton med vanliga maskininlärnings algoritmer (Swedish)
Abstract [en]

This study investigated if it was feasible to use machine learning tools on OCR extracted text data to classify receipts and extract specific data points. Two OCR tools were evaluated, the first was Azure Computer Vision API and the second was Google Drive REST Api, where Google Drive REST Api was the main OCR tool used in the project because of its impressive performance. The classification task mainly tried to predict which of five given categories the receipts belongs to, and also a more challenging task of predicting specific subcategories inside those five larger categories. The data points we where trying to extract was the date of purchase on the receipt and the total price of the transaction. The classification was mainly done with the help of scikit-learn, while the extraction of data points was achieved by a simple custom made N-gram model.

The results were promising with about 94 % cross validation score for classifying receipts based on category with the help of a LinearSVC classifier. Our custom model was successful in 72 % of cases for the price data point while the results for extracting the date was less successful with an accuracy of 50 %, which we still consider very promising given the simplistic nature of the custom model.

Place, publisher, year, edition, pages
2018. , p. 13
Keywords [en]
Optical character recognition, Machine learning, Receipts
National Category
Information Systems
Identifiers
URN: urn:nbn:se:liu:diva-148350ISRN: LIU-IDA/LITH-EX-G--18/043—SEOAI: oai:DiVA.org:liu-148350DiVA, id: diva2:1215460
External cooperation
IFS
Subject / course
Computer Programming
Presentation
2018-06-08, I204, Campus Valla, Linköping, 08:35 (English)
Supervisors
Examiners
Available from: 2018-06-18 Created: 2018-06-08 Last updated: 2018-06-18Bibliographically approved

Open Access in DiVA

fulltext(374 kB)401 downloads
File information
File name FULLTEXT01.pdfFile size 374 kBChecksum SHA-512
f5c6f968d04a35849bec7aececc6ce4ea847903955e6bfd7de28f2e111dfd0a6bb637fb0cff3b3e8d212bf0dc2cf2f0b5787fc8bfd10163de9292917bc15df2b
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Odd, JoelTheologou, Emil
By organisation
Department of Computer and Information Science
Information Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 401 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 505 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf