Privacy and explainability in Healthcare AI: Synthetic data generation from Swedish patient records
2024 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE credits
Student thesis
Abstract [en]
The digitilization of the health-care sector has led to an exponential increase in the electronic health record (EHR) data. However, leveraging EHR data for training artificial intelligence models while preserving patient privacy pos- sesses significant challenge. De-identification and synthetic data generation are some strategies employed to mitigate the privacy risk. Despite advancements in synthetic data generation, a significant gap exists in understanding the ex- plainability of AI models trained on synthetic data. Ensuring patient privacy, transparency and interpretability of the AI models, in high critical medical de- cision making process, are paramount.
This thesis address the research question ”How can synthetic tabular data be generated from Swedish patient’s electronic health records, preserving privacy and ensuring transparecy and exlpainability of the AI model ?”
Applying design science research framework, the research generates a synthetic Adverse Drug Event (ADE) datasets from the Swedish patients EHR data using CTGAN. The synthetic data is evaluated for privacy preservation and utility using SynthEval. During privacy evaluation, one of the synthetic datasets ex- hibited an epsilon identifiability risk of 0.16. Random forest classifiers were trained on the synthetic and original datasets and performance estimates were generated for comparative analysis. The classifiers trained on the synthetic data exhibited tremendous performance. SHAP explanations were generated by XAI models trained on synthetic and original data. A comparative analysis of these SHAP explanations demonstrated consistent similarity. The similarity of the SHAP explanations were quantified using gower distance.
The research highlights the efficacy of CTGAN to generate tabular synthetic data from Swedish patient’s EHR data, preserving patient privacy and ensuring transparency and explainability of AI models. Although this research focused on the ADE dataset for L270 due to limitation of time and resource constraints, future investigations could extend to other adverse drug events. Another ap- proach for further studies involves generating synthetic data for positive and negative classes separately and utilizing other XAI methods for generating ex- planations.
Place, publisher, year, edition, pages
2024.
Keywords [en]
Synthetic data, EHR, CTGAN, ADE, XAI, SHAP
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:su:diva-242795OAI: oai:DiVA.org:su-242795DiVA, id: diva2:1955728
2025-04-302025-04-30