Experimental Evaluation of a Machine-Learning Method for Generating Synthetic Patient Data for Applications in Health Economics and Outcomes Research

Speaker(s)

Chebuniaev I1, Aballea S2, Toumi M3
1InovIntell, Tbilisi, Georgia, 2InovIntell, Rotterdam, South Holland, Netherlands, 3Aix-Marseille University, Marseille, France

OBJECTIVES: Synthetic data are increasingly used, across several industries, but their potential in HEOR is still to be explored. Our objective was to develop and validate a method using machine-learning for building synthetic data generators based on clinical trial data.

METHODS: Our generative model architecture combines variational autoencoders and neural ordinary differential equations. The current version accepts continuous and discrete baseline variables and continuous variables for follow-up visits. Data can be generated for follow-up visits conditional upon values at baseline. The model was trained on data from a clinical trial in diabetic macular oedema, with 660 patients. The validity of generated data was assessed in the overall sample and in subgroups by comparing marginal distributions of variables between synthetic and original with Kolmogorov-Smirnov (KS) tests and by comparing Spearman correlations. To assess the model generalizability, the Root Mean Squared Error (RMSE) was computed using a 4-fold cross validation method. Statistical tests were interpreted using a 5% significance level.

RESULTS: KS tests showed similar distributions between original and synthetic data for all variables based on the full data, and for 96% of variables when splitting data by treatment arm. The average absolute difference in correlation coefficients was about 0.1. The average distance of the predicted sample mean visual acuity score to the true mean, in a test set of patients not used for model training, measured by RMSE, was 1.0 (on a 0-100 scale), comparable to the standard error of the true mean at a given visit (>0.6).

CONCLUSIONS: The model was able to generate valid synthetic data in DME. Example of possible applications in HEOR include data anonymization, synthetic control arms, indirect treatment comparisons, patient-level simulation or linking synthetic datasets when linking original data is not authorized. Research to assess the model ability to extrapolate over time is ongoing.

Code

MSR128

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Clinical Trials

Disease

No Additional Disease & Conditions/Specialized Treatment Areas