Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Generating synthetic training text from Swedish Electronic Health Records
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2024 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Electronic Health Records contain valuable data but, due to the inherent private nature of the data, this data is difficult to share. Electronic Health Records usually contain patient notes in free-text, making it an interesting application for Natural Language Processing. Synthetic data generation is a proposed application of Natural Language Processing where language models trained on the data from an Electronic Health Record could serve as a replacement for the original dataset while avoiding privacy issues, thus allowing more data to be shared for research. This thesis explores generating synthetic training text to be used for downstream tasks, evaluating the utility of the generated data. Based on the research question, “Will a NER-model trained on a synthetic dataset perform as well as a NER-model trained on a pseudonymized corpus?”, we formulate three hypotheses. The goal of the thesis is to evaluate whether the Named Entity Recognition (NER) model trained on synthetic data, can perform as well as another NER-model trained on pseudonymized data, thus implicating the synthetic dataset utility and indicating whether the dataset can adequately replace the pseudonymized dataset. The thesis is conducted with an experiment where the generated data is evaluated based on the n-gram overlap and performance in an Named Entity Recognition downstream task. Two NER-models are trained, one on the synthetic dataset and one on the pseudonymized data. These models are then evaluated on a manually annotated dataset, used as a gold standard. The generated synthetic dataset, Stockholm EPR Gastro Synthetic Corpus, while showing some promise in being linguistically coherent with minor spelling and grammatical errors, does not adequately capture the variance of the pseudonymized data. The synthetic dataset has a higher 8-gram overlap with the pseudonymized counterpart when compared to similar studies, indicating a higher risk of leaking data. The NER-model trained on the synthetic dataset performs worse in almost every metric when compared to the NER-model trained on the pseudonymized dataset. The experiment suggests that the synthetic training data cannot adequately replace the pseudonymized data.

Place, publisher, year, edition, pages
2024.
Keywords [en]
Synthetic data generation, Natural Language Processing, Transformers, BART, Electronic Health Records
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:su:diva-242730OAI: oai:DiVA.org:su-242730DiVA, id: diva2:1955662
Available from: 2025-04-30 Created: 2025-04-30

Open Access in DiVA

fulltext(719 kB)15 downloads
File information
File name FULLTEXT01.pdfFile size 719 kBChecksum SHA-512
dbdf8a10d25b5d005a2e08570749a5029e82cf84778c45a799309366643fb2106a1f76c53ade7d3c6505665b685f66571042f52c62f583827db8373a7543a8c3
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Hullmann, TyrHansson, Martin
By organisation
Department of Computer and Systems Sciences
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 15 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 14 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf