Improving Access to German Health Claims Data Through Synthetic Data Generation: Findings and Achievements of a Holistic Evaluation
Author(s)
Tobias Heidler, Staatsexamen Pharmazie1, Michael Schultze, Dr., Staatsexamen Tiermedizin2, George Kafatos, PhD, MSc3, Bagmeet Behera, PhD, MSc4, Caroline Lienau, MSc5, Alexander Franz Unger, Dr., Mag.5, Valentina Balko, Dr.5, Julius Brandenburg, PhD, Diplom Biologie5, Lea Grotenrath, MSc5, Zhenchen Wang, PhD6, Philipp Großer, MSc7, Adam Hilbert, MSc8, Nils Kossack, Dipl.-Math.1, Marc Pignot, PhD, MSc2.
1WIG2 GmbH – Scientific Institute for Health Economics and Health System Research, Leipzig, Germany, 2ZEG – Berlin Center for Epidemiology and Health Research GmbH, Berlin, Germany, 3Amgen Limited, Uxbridge, United Kingdom, 4Amgen Research (Munich) GmbH, Munich, Germany, 5AstraZeneca GmbH, Hamburg, Germany, 6Medicines and Healthcare products Regulatory Agency (MHRA), London, United Kingdom, 7Limebit GmbH, Berlin, Germany, 8ai4medicine UG, Berlin, Germany.
1WIG2 GmbH – Scientific Institute for Health Economics and Health System Research, Leipzig, Germany, 2ZEG – Berlin Center for Epidemiology and Health Research GmbH, Berlin, Germany, 3Amgen Limited, Uxbridge, United Kingdom, 4Amgen Research (Munich) GmbH, Munich, Germany, 5AstraZeneca GmbH, Hamburg, Germany, 6Medicines and Healthcare products Regulatory Agency (MHRA), London, United Kingdom, 7Limebit GmbH, Berlin, Germany, 8ai4medicine UG, Berlin, Germany.
OBJECTIVES: Synthetic data can be leveraged to improve access to valuable real-world data without compromising patient privacy. After previously developing a holistic evaluation framework, this study explores and evaluates various synthetic data generation approaches based on a longitudinal relational German health claims database.
METHODS: Data was sourced from the WIG2 benchmark database, a longitudinal German health claims dataset. A cohort of patients with systemic lupus erythematosus (SLE) was selected to serve as a clinically complex case study. Synthetic datasets were generated using Generative Adversarial Networks (GANs), Adversarial Random Forests (ARFs), and two Bayesian Networks (BNs), trained on off-the-shelve hardware. Evaluation focused on privacy, scalability, fidelity, and utility.
RESULTS: A cohort of 6,743 SLE patients was used for training. All methods generated privacy-preserving data, free of duplicate records and demonstrated strong resistance to privacy attacks. Some implementations processed the full dataset efficiently, others necessitated significant data simplification before training, limiting scalability. Medium fidelity was achieved, though each method showed varying degrees of success. Standardized mean differences ranged from perfect alignment (0.0) to substantial discrepancies (7.0) in univariate distributions. Regarding utility, all synthesized datasets proved capable of supporting basic analysis script development, albeit with specific limitations in their implementation. More complex real-world evidence (RWE) analyses, such as disease prevalence, treatment patterns, healthcare resource utilization, and temporal analyses, demonstrated varying levels of robustness.
CONCLUSIONS: This study demonstrates that synthetic data methods provide a promising approach to enhance access to German health claims data while maintaining privacy, making them attractive for privacy-sensitive applications. While fidelity remains a significant challenge, careful selection and tailoring of synthetic datasets can mitigate some limitations, enabling scripting and analysis development in specific scenarios. Careful consideration is required when using these synthetic datasets for generating real-world evidence.
METHODS: Data was sourced from the WIG2 benchmark database, a longitudinal German health claims dataset. A cohort of patients with systemic lupus erythematosus (SLE) was selected to serve as a clinically complex case study. Synthetic datasets were generated using Generative Adversarial Networks (GANs), Adversarial Random Forests (ARFs), and two Bayesian Networks (BNs), trained on off-the-shelve hardware. Evaluation focused on privacy, scalability, fidelity, and utility.
RESULTS: A cohort of 6,743 SLE patients was used for training. All methods generated privacy-preserving data, free of duplicate records and demonstrated strong resistance to privacy attacks. Some implementations processed the full dataset efficiently, others necessitated significant data simplification before training, limiting scalability. Medium fidelity was achieved, though each method showed varying degrees of success. Standardized mean differences ranged from perfect alignment (0.0) to substantial discrepancies (7.0) in univariate distributions. Regarding utility, all synthesized datasets proved capable of supporting basic analysis script development, albeit with specific limitations in their implementation. More complex real-world evidence (RWE) analyses, such as disease prevalence, treatment patterns, healthcare resource utilization, and temporal analyses, demonstrated varying levels of robustness.
CONCLUSIONS: This study demonstrates that synthetic data methods provide a promising approach to enhance access to German health claims data while maintaining privacy, making them attractive for privacy-sensitive applications. While fidelity remains a significant challenge, careful selection and tailoring of synthetic datasets can mitigate some limitations, enabling scripting and analysis development in specific scenarios. Careful consideration is required when using these synthetic datasets for generating real-world evidence.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
RWD100
Topic
Methodological & Statistical Research, Real World Data & Information Systems
Topic Subcategory
Data Protection, Integrity, & Quality Assurance, Distributed Data & Research Networks
Disease
No Additional Disease & Conditions/Specialized Treatment Areas, Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain)