Synthetic Data Generation Methods Using Artificial Intelligence (AI): A Simple New AI Tool That Generates Synthetic Data Using Generative Adversarial Network (CTGAN), Copula GANs, and Sequential Decision Tree Methods
Author(s)
Iftekhar Khan, PhD1, Reynaldo Martina, PhD2, Ralph Crott, MPH, MSc, PhD3, Rabiah Begum, MSc3, Saqib Ur Rehman, PhD3.
1University of Warwick, Coventry, United Kingdom, 2Utrecht University, Utrecht, Netherlands, 3Regulatory Scientific and Health Solutions, Solihull, United Kingdom.
1University of Warwick, Coventry, United Kingdom, 2Utrecht University, Utrecht, Netherlands, 3Regulatory Scientific and Health Solutions, Solihull, United Kingdom.
OBJECTIVES: There is an increasing interest in synthetically generated data (SGD). The joint clinical assessment (JCA) may stipulate complete patient level data availability for indirect treatment comparisons (ITC) as part of the Joint Clinical and Health Technology Assessment (HTA) process. The sharing and distribution of sensitive patient data can be limited by data privacy regulations. We demonstrate a newly developed AI agent producing SGD, using at least three approaches. SGD from randomized clinical trials (RCTs) and electronic health records (HER) is shown to lead to the same conclusions as those from the real data.
METHODS: A new SGD AI tool employs Conditional Tabular Generative Adversarial Networks (CTGAN), Copula GANs and Sequential Decision Trees (SDT). For the GANs, a dual-network adversarial architecture comprising a generator/ discriminator was used. The performance (e.g. accuracy quality) of each method was evaluated using hospital episodes data (HES) data (N=14,423) in a chronic kidney disease (CKD) population and RCT data (N=670) in a non-small cell lung cancer population (NSCLC).
RESULTS: The CTGAN was trained on 14,428 patient level healthcare resource use data collected between 2012 and 2015. The accuracy of the SGD was high: shape metric scores between 0.905 to 0.942 (i.e. strong similarity while maintaining privacy). These results were higher for the RCT data. In addition, for the RCT data, the reported hazard ratio (HR) of erlotinib vs BSC in the reported trial was 0.94 (95% CI: 0.81,1.10; p=0.462). Using SGD, these were: 0.93 (95% CI: 0.79,1.12; p=0.481).
CONCLUSIONS: We show a simple new AI tool that can be used to import the actual data and output corresponding SGD with a high probability of achieving the same decision as the real data. Our findings confirm that SGD can closely replicate real-world healthcare data offering a practical and privacy-preserving alternative to using sensitive patient records.
METHODS: A new SGD AI tool employs Conditional Tabular Generative Adversarial Networks (CTGAN), Copula GANs and Sequential Decision Trees (SDT). For the GANs, a dual-network adversarial architecture comprising a generator/ discriminator was used. The performance (e.g. accuracy quality) of each method was evaluated using hospital episodes data (HES) data (N=14,423) in a chronic kidney disease (CKD) population and RCT data (N=670) in a non-small cell lung cancer population (NSCLC).
RESULTS: The CTGAN was trained on 14,428 patient level healthcare resource use data collected between 2012 and 2015. The accuracy of the SGD was high: shape metric scores between 0.905 to 0.942 (i.e. strong similarity while maintaining privacy). These results were higher for the RCT data. In addition, for the RCT data, the reported hazard ratio (HR) of erlotinib vs BSC in the reported trial was 0.94 (95% CI: 0.81,1.10; p=0.462). Using SGD, these were: 0.93 (95% CI: 0.79,1.12; p=0.481).
CONCLUSIONS: We show a simple new AI tool that can be used to import the actual data and output corresponding SGD with a high probability of achieving the same decision as the real data. Our findings confirm that SGD can closely replicate real-world healthcare data offering a practical and privacy-preserving alternative to using sensitive patient records.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR193
Topic
Health Technology Assessment, Methodological & Statistical Research, Real World Data & Information Systems
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas