Digitala Vetenskapliga Arkivet

Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Evaluating Utility and Privacy in Synthetic Data Derived from Structured Swedish Electronic Patient Records
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2024 (Swedish)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Background

The use of Electronic Health Records (EHRs) in healthcare has facilitated the exchange and analysis of patient information, which enhances both patient care and medical research. However, the use of real patient data for training Artificial Intelligence (AI) systems introduces privacy risks. To tackle the privacy challenges and restricting data availability, synthetic data has emerged as a solution.

Aim

This thesis assesses the Synthetic Data Vault tools capabilities in generating syn- thetic data from a Swedish EHR dataset, focusing on its ability to balance privacy with data utility.

Method

This study focused on generating synthetic datasets to use in Adverse Drug Events detecting models, using Conditional Tabular Generative Adversarial Network (CT- GAN), Tabular Variational Autoencoder (TVAE), and GaussianCopula synthesiz- ers. Evaluation was conducted with the SynthEval Python package, and the most resilient datasets were further tested against adversarial attacks via Anonymeter Python package. Downstream performance was assessed by comparing the syn- thetic data to the original datasets using Random Forest models.

Results

The evaluation showed that TVAE struggled with imbalanced training data, while GaussianCopula excelled in privacy at the expense of utility aspects. CTGAN struck a balance between utility and privacy. The Anonymeter tests further con- firmed the synthetic datasets’ robustness against common privacy attacks.

Conclusion

The study concludes that synthetic data, specifically from CTGAN, can serve as a viable alternative to real data in contexts where privacy is a paramount concern, such as in training AI systems with EHRs. It also highlights the importance of tailored synthetic data solutions depending on the requirements of the use case and the characteristics of the original data.

Place, publisher, year, edition, pages
2024.
Keywords [en]
electronic health records, synthetic data, CTGAN, TVAE, GaussianCopula, Synthetic Data Vault
National Category
Information Systems, Social aspects
Identifiers
URN: urn:nbn:se:su:diva-242706OAI: oai:DiVA.org:su-242706DiVA, id: diva2:1955597
Available from: 2025-04-30 Created: 2025-04-30

Open Access in DiVA

fulltext(1386 kB)19 downloads
File information
File name FULLTEXT01.pdfFile size 1386 kBChecksum SHA-512
b278a913ef904d1f0f2b5bfa80953b51807e57108f258ab82965aaefce9ed415e2280394b0ece9f603d654f369367c8c8581b14eda962597d8a454fbd3080903
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Dinh, Thu
By organisation
Department of Computer and Systems Sciences
Information Systems, Social aspects

Search outside of DiVA

GoogleGoogle Scholar
Total: 19 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 22 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf