Evaluating Utility and Privacy in Synthetic Data Derived from Structured Swedish Electronic Patient Records
2024 (Swedish)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
Student thesis
Abstract [en]
Background
The use of Electronic Health Records (EHRs) in healthcare has facilitated the exchange and analysis of patient information, which enhances both patient care and medical research. However, the use of real patient data for training Artificial Intelligence (AI) systems introduces privacy risks. To tackle the privacy challenges and restricting data availability, synthetic data has emerged as a solution.
Aim
This thesis assesses the Synthetic Data Vault tools capabilities in generating syn- thetic data from a Swedish EHR dataset, focusing on its ability to balance privacy with data utility.
Method
This study focused on generating synthetic datasets to use in Adverse Drug Events detecting models, using Conditional Tabular Generative Adversarial Network (CT- GAN), Tabular Variational Autoencoder (TVAE), and GaussianCopula synthesiz- ers. Evaluation was conducted with the SynthEval Python package, and the most resilient datasets were further tested against adversarial attacks via Anonymeter Python package. Downstream performance was assessed by comparing the syn- thetic data to the original datasets using Random Forest models.
Results
The evaluation showed that TVAE struggled with imbalanced training data, while GaussianCopula excelled in privacy at the expense of utility aspects. CTGAN struck a balance between utility and privacy. The Anonymeter tests further con- firmed the synthetic datasets’ robustness against common privacy attacks.
Conclusion
The study concludes that synthetic data, specifically from CTGAN, can serve as a viable alternative to real data in contexts where privacy is a paramount concern, such as in training AI systems with EHRs. It also highlights the importance of tailored synthetic data solutions depending on the requirements of the use case and the characteristics of the original data.
Place, publisher, year, edition, pages
2024.
Keywords [en]
electronic health records, synthetic data, CTGAN, TVAE, GaussianCopula, Synthetic Data Vault
National Category
Information Systems, Social aspects
Identifiers
URN: urn:nbn:se:su:diva-242706OAI: oai:DiVA.org:su-242706DiVA, id: diva2:1955597
2025-04-302025-04-30