Evaluating the Potential of Synthetic Patient Data Generation to Accelerate Real-World Evidence (RWE) Generation

Speaker(s)

Törnqvist M, Dry L, Pinon G, Movschin A
Quinten Health, Paris, France

Presentation Documents

OBJECTIVES: The increasing rise of machine learning methods in medical research requires large-scale and high-quality patient data. However, concerns regarding privacy, cost, and availability limit their accessibility. Leveraging synthetic data that mimics Real-World data (RWD) emerges as a promising solution, increasingly considered by pharmaceutical industries. This approach enables the generation of customized synthetic patient data of various sizes, without some of the limitations of RWD such as missing values and class imbalances. Recently, deep learning methods, such as Generative Adversarial Networks (GANs), have demonstrated remarkable performance in generating reference RWD, particularly in the field of economics. This study evaluated GANs, for synthesizing electronical health records (EHRs).

METHODS: MIMIC-III, a publicly accessible database of EHRs from intensive care, was chosen to train GANs specifically designed for synthesizing tabular data, CTGAN and CTABGAN. CTGAN addresses class imbalance by incorporating conditional generation, while CTABGAN can model a mixture of continuous and categorical variables through innovative data encoding. The synthetic data was then evaluated for fidelity, privacy and correlation with the original data using statistical measures and comparative visualizations of data distributions.

RESULTS: This study highlighted the potential of GAN-based deep learning approaches for generating synthetic patient data. The evaluation of two GANs on MIMIC-III demonstrated their ability to produce realistic synthetic health data while preserving privacy. However, GANs require large datasets, significant computational resources, and can be challenging to converge.

CONCLUSIONS: There is a need for consensus on the evaluation of synthetic data among researchers, regulators, and pharmaceutical industries. The level and quantity of evidence required to consider synthetic data reliable and validated for practical use depends on the judgment criteria and the objective of use. For instance, data-augmentation for modeling improvement regulatory-grade synthetic control arms may have different validation requirements.

Code

MSR138

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

Injury & Trauma, No Additional Disease & Conditions/Specialized Treatment Areas