Change search
ReferencesLink to record
Permanent link

Direct link
Classifying receipts or invoices from images based on text extraction
Linnaeus University, Faculty of Technology, Department of Computer Science.
2016 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Nowadays, most of the documents are stored in electronic form and there is a high demand to organize and categorize them efficiently. Therefore, the field of automated text classification has gained a significant attention both from science and industry. This technology has been applied to information retrieval, information filtering, news classification, etc. The goal of this project is the automated text classification of photos as invoices or receipts in Visma Mobile Scanner, based on the previously extracted text. Firstly, several OCR tools available on the market have been evaluated in order to find the most accurate to be used for the text extraction, which turned out to be ABBYY FineReader. The machine learning tool WEKA has been used for the text classification, with the focus on the Naïve Bayes classifier. Since the Naïve Bayes implementation provided by WEKA does not support some advances in the text classification field such as N-gram, Laplace smoothing, etc., an improved version of Naïve Bayes classifier which is more specialized for the text classification and the invoice/receipt classification has been implemented. Improving the Naive Bayes classifier, investigating how it can be improved for the problem domain and evaluating the obtained classification accuracy compared to the generic Naïve Bayes are the main parts of this research. Experimental results show that the specialized Naïve Bayes classifier has the highest accuracy. By applying the Fixed penalty feature, the best result of 95.6522% accuracy on cross-validation mode has been achieved. In case of more accurate text extraction, the accuracy is even higher. 

Place, publisher, year, edition, pages
2016. , 49 p.
Keyword [en]
text classification, Naive Bayes, OCR tools, classifying receipts or invoices
National Category
Computer Science
Identifiers
URN: urn:nbn:se:lnu:diva-54647OAI: oai:DiVA.org:lnu-54647DiVA: diva2:949078
External cooperation
Visma Spcs
Educational program
Software Technology Programme, Master Programme, 120 credits
Supervisors
Examiners
Available from: 2016-08-15 Created: 2016-07-15 Last updated: 2016-08-15Bibliographically approved

Open Access in DiVA

fulltext(1699 kB)8 downloads
File information
File name FULLTEXT01.pdfFile size 1699 kBChecksum SHA-512
c467b23d9e7b8479660613388dce591aea419058bbe7d5fd68be161ae3dce8e922db37aa969cd0ce88c01e157aa3bdc8b2b3f979ba53ca12a6a68e4233d0fad9
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Kaci, Iuliia
By organisation
Department of Computer Science
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 8 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 3 hits
ReferencesLink to record
Permanent link

Direct link