Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Towards a fully automated extraction and interpretation of tabular data using machine learning
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Systems and Control.
2019 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Motivation

A challenge for researchers at CBCS is the ability to efficiently manage the different data formats that frequently are changed. This handling includes import of data into the same format, regardless of the output of the various instruments used. There are commercial solutions available for this process, but to our knowledge, all these require prior generation of templates to which data must conform.A challenge for researchers at CBCS is the ability to efficiently manage the different data formats that frequently are changed. Significant amount of time is spent on manual pre- processing, converting from one format to another. There are currently no solutions that uses pattern recognition to locate and automatically recognise data structures in a spreadsheet.

Problem Definition

The desired solution is to build a self-learning Software as-a-Service (SaaS) for automated recognition and loading of data stored in arbitrary formats. The aim of this study is three-folded: A) Investigate if unsupervised machine learning methods can be used to label different types of cells in spreadsheets. B) Investigate if a hypothesis-generating algorithm can be used to label different types of cells in spreadsheets. C) Advise on choices of architecture and technologies for the SaaS solution.

Method

A pre-processing framework is built that can read and pre-process any type of spreadsheet into a feature matrix. Different datasets are read and clustered. An investigation on the usefulness of reducing the dimensionality is also done. A hypothesis-driven algorithm is built and adapted to two of the data formats CBCS uses most frequently. Discussions are held on choices of architecture and technologies for the SaaS solution, including system design patterns, web development framework and database.

Result

The reading and pre-processing framework is in itself a valuable result, due to its general applicability. No satisfying results are found when using mini-batch K means clustering method. When only reading data from one format, the dimensionality can be reduced from 542 to around 40 dimensions. The hypothesis-driven algorithm can consistently interpret the format it is designed for. More work is needed to make it more general.

Implication

The study contribute to the desired solution in short-term by the hypothesis-generating algorithm, and in a more generalisable way by the unsupervised learning approach. The study also contributes by initiating a conversation around the system design choices.

Place, publisher, year, edition, pages
2019.
Series
UPTEC F, ISSN 1401-5757 ; 19050
Keywords [en]
machine learning, unsupervised machine learning, spreadsheets
National Category
Information Systems
Identifiers
URN: urn:nbn:se:uu:diva-391490OAI: oai:DiVA.org:uu-391490DiVA, id: diva2:1347050
External cooperation
Nordron AB; Science for Life Laboratory (SciLifeLab); Chemical Biology Consortium Sweden (CBCS)
Subject / course
Information Systems
Educational program
Master Programme in Engineering Physics
Supervisors
Examiners
Available from: 2019-09-04 Created: 2019-08-29 Last updated: 2019-09-04Bibliographically approved

Open Access in DiVA

fulltext(5786 kB)3274 downloads
File information
File name FULLTEXT02.pdfFile size 5786 kBChecksum SHA-512
235a6bc83bcd144f6fd0b83be346a464fd47064d19f5b2634fa6f1dc1c0f51dbf80af9abb10f0b4a01bfc592e0104c583593e2402a0280c1a67c72ab87c673b8
Type fulltextMimetype application/pdf

By organisation
Division of Systems and Control
Information Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 3276 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 351 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf