Synthetic Data Generation Methods Using Artificial Intelligence (AI): A Simple New AI Tool That Generates Synthetic Data Using Generative Adversarial Network (CTGAN), Copula GANs, and Sequential Decision Tree Methods

Author(s)

Iftekhar Khan, PhD¹, Reynaldo Martina, PhD², Ralph Crott, MPH, MSc, PhD³, Rabiah Begum, MSc³, Saqib Ur Rehman, PhD³.
¹University of Warwick, Coventry, United Kingdom, ²Utrecht University, Utrecht, Netherlands, ³Regulatory Scientific and Health Solutions, Solihull, United Kingdom.

Presentation Documents

ECLIPTICA Poster 2025 v1.0.pdf

OBJECTIVES: There is an increasing interest in synthetically generated data (SGD). The joint clinical assessment (JCA) may stipulate complete patient level data availability for indirect treatment comparisons (ITC) as part of the Joint Clinical and Health Technology Assessment (HTA) process. The sharing and distribution of sensitive patient data can be limited by data privacy regulations. We demonstrate a newly developed AI agent producing SGD, using at least three approaches. SGD from randomized clinical trials (RCTs) and electronic health records (HER) is shown to lead to the same conclusions as those from the real data.
METHODS: A new SGD AI tool employs Conditional Tabular Generative Adversarial Networks (CTGAN), Copula GANs and Sequential Decision Trees (SDT). For the GANs, a dual-network adversarial architecture comprising a generator/ discriminator was used. The performance (e.g. accuracy quality) of each method was evaluated using hospital episodes data (HES) data (N=14,423) in a chronic kidney disease (CKD) population and RCT data (N=670) in a non-small cell lung cancer population (NSCLC).
RESULTS: The CTGAN was trained on 14,428 patient level healthcare resource use data collected between 2012 and 2015. The accuracy of the SGD was high: shape metric scores between 0.905 to 0.942 (i.e. strong similarity while maintaining privacy). These results were higher for the RCT data. In addition, for the RCT data, the reported hazard ratio (HR) of erlotinib vs BSC in the reported trial was 0.94 (95% CI: 0.81,1.10; p=0.462). Using SGD, these were: 0.93 (95% CI: 0.79,1.12; p=0.481).
CONCLUSIONS: We show a simple new AI tool that can be used to import the actual data and output corresponding SGD with a high probability of achieving the same decision as the real data. Our findings confirm that SGD can closely replicate real-world healthcare data offering a practical and privacy-preserving alternative to using sensitive patient records.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

MSR193

Topic

Health Technology Assessment, Methodological & Statistical Research, Real World Data & Information Systems

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)