Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Automating Data Extraction from Documents Using Large Language Models: A Study Exploring How AI Can Be Used to Transform Unstructured Data into Structured Formats
Mid Sweden University, Faculty of Science, Technology and Media, Department of Computer and Electrical Engineering (2023-).
2024 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [sv]

Målet med detta examensarbete är att extrahera minst 70 % av externa referenser från ett av tre testdokument och att jämföra resultaten mellan två stora språkmodeller (LLM) med kvantitativa metoder. Detta mättes genom att utvärdera både antalet korrekt identifierade referenser och likheten hos dessa referenser jämfört med det förväntade resultatet. Processen börjar med att extrahera text från ett PDF-dokument, följt av att dela upp texten i meningar. “Embeddings” genereras sedan för varje mening. “Cosine similarity” utförs på dessa “embeddings” för att filtrera bort meningar som inte innehåller den begärda informationen. De återstående meningarna bearbetas med två OpenAI-modeller, gpt-3.5-turbo-0125 och gpt-4-turbo-2024-04-09, som nås via deras API. Varje modell får sedan en detaljerad instruktion att extrahera externa referenser från de filtrerade meningarna. De extraherade referenserna jämförs sedan mot de förväntade ut datat på två sätt: med antalet korrekt identifierade referenser och med detaljnivån i de extraherade referenserna. Modellen gpt-4-turbo-2024-04-09 extraherade framgångsrikt 42 av 43 referenser, där 41 var optimala och den återstående saknade viss information. Modellen gpt-3.5-turbo-0125 extraherade 41 av 43 referenser, där 31 matchade det förväntade ut datat perfekt. Dessa resultat visar potentialen hos stora språkmodeller för att extrahera data från ostrukturerade källor med en högre träffsäkerhet.

Abstract [en]

The objective of this thesis is to accurately extract at least 70% of external references from one of three test documents and to compare the performance of two Large Language Models (LLMs) using quantitative methods. This is measured by evaluating both the number of identified references and the amount of these references that are similar to the expected output. The process begins with extracting text from a PDF document, followed by dividing the text into sentences. Embeddings are then generated for each sentence. Cosine similarity is performed on these embeddings to filter out sentences that do not contain the requested data. The remaining sentences are processed using two OpenAI models, gpt-3.5-turbo-0125 and gpt-4-turbo-2024-04-09, accessed via their API. Each model is instructed to extract external references from the filtered sentences. The extracted references are then compared against the expected outputs in two ways: by the number of correctly identified references and by the level of detail in the extracted references. The gpt-4-turbo-2024-04-09 model successfully extracted 42 out of 43 references, with 41 being optimal and the remaining missing some information. The gpt-3.5-turbo-0125 model extracted 41 out of 43 references, with 31 matching the expected output perfectly. These results demonstrate the potential of Large Language Models in accurately extracting data from unstructured sources.

Place, publisher, year, edition, pages
2024. , p. 59
Keywords [en]
Large Language Models, data extraction, unstructured sources, embeddings, cosine similarity
Keywords [sv]
Stora språkmodeller, dataextrahering, ostrukturerad data, embeddings, cosine similarity
National Category
Software Engineering
Identifiers
URN: urn:nbn:se:miun:diva-52575Local ID: DT-V24-G3-050OAI: oai:DiVA.org:miun-52575DiVA, id: diva2:1900517
Subject / course
Computer Engineering DT1
Educational program
Computer Science TDATG 180 higher education credits
Supervisors
Examiners
Available from: 2024-09-24 Created: 2024-09-24 Last updated: 2024-09-24Bibliographically approved

Open Access in DiVA

fulltext(1300 kB)1151 downloads
File information
File name FULLTEXT01.pdfFile size 1300 kBChecksum SHA-512
3074fe2bf9bff52ee8d75d0da6d036d38fe4a05d931e9e5fc90bbcaf5c00b65d028a30676c201512678564485cbbb94c5d010a8baecc829c1626c2bd116ad483
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Persson, Lucas
By organisation
Department of Computer and Electrical Engineering (2023-)
Software Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 1151 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 1279 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf