Augmenting External Control Arms Using Synthetic Data Generation
Author(s)
O Meachair S1, Malagarriga D1, Mosquera L2
1Aetion, Barcelona, Catalunya, Spain, 2Aetion, Ottawa, ON, Canada
OBJECTIVES: To determine whether Synthetic Data Generation (SDG) methods can be used to improve parameter estimation in rare disease studies where limited or no control data is available from clinical trials, and external control arms have limited patients meeting inclusion/exclusion criteria with which to estimate treatment effectiveness.
METHODS: We compare two different methods of SDG for tabular data - Sequential Decision Trees and Bayesian Networks - to two standard baseline approaches: propensity score weighting of external control data, and bootstrap sampling of available data. The methods are compared on two different datasets - a simulated dataset where all true parameters are known, and a subset of diabetes patients from the Marketscan dataset. Both datasets are split into two cohorts - a ‘clinical trial’ cohort where treatment is randomly assigned across both treatment and control arms, and an ‘external’ arm which is a biased sample from the overall population. Available control and external data is pooled for input into the SDG methods, and are augmented with synthetic data. All methods are compared in terms of bias and variance of the estimate of median Progression Free Survival (PFS) in the control data, relative to the known population PFS estimate.
RESULTS: We show that in certain conditions SDG methods provide improved accuracy of effect estimates over baseline methods. For the Marketscan data, Sequential Decision Trees consistently provide less biased estimates compared to other methods for the majority of data scenarios, while for simulated data, Bayesian Networks provide better or comparable performance to baseline approaches.
CONCLUSIONS: SDG has shown improved estimation for population parameters in certain data scenarios. Further work is required to describe which SDG method is appropriate given characteristics of each dataset, as well as how to assess the performance of SDG methods in real world scenarios where the true population parameters are not known.
Conference/Value in Health Info
Value in Health, Volume 27, Issue 12, S2 (December 2024)
Code
MSR175
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas