A PILOT ASSESSMENT OF LLM-GENERATED SYNTHETIC COHORTS: A FIRST STEP TOWARD ROBUST SYNTHETIC CONTROL ARMS
Author(s)
Manuel Cossio, MPhil, MS1, Deepa Jahagirdar, PhD2, Anupama Vasudevan, MPH, PhD3;
1Cytel, Director, Artificial Intelligence Lead, Dubendorf, Switzerland, 2Cytel Inc, Bellevue, WA, USA, 3Cytel, Plano, TX, USA
1Cytel, Director, Artificial Intelligence Lead, Dubendorf, Switzerland, 2Cytel Inc, Bellevue, WA, USA, 3Cytel, Plano, TX, USA
OBJECTIVES: This study evaluates two methodologies based on large language models (LLM) for generating synthetic clinical trial datasets intended for use in External Control Arms (ECAs). It compares direct generative output against automated code execution for statistical fidelity and reproducibility.
METHODS: Using data from the Vitamin D and Omega-3 Trial (VITAL; NCT01169259), sponsored by Brigham and Women's Hospital, we conducted two experiments. The first method involved direct generation where an LLM produced a synthetic dataset in an Excel file (n=100) by interpreting the original data alongside a variable dictionary. The second method utilized a code-augmented approach where the LLM drafted a Python pipeline to perform bootstrapping and total anonymization. This script implemented an algorithmic noise filter that identified continuous numeric variables and applied additive Gaussian noise scaled to 5 percent of each variable's original standard deviation. Values were clipped to original ranges to maintain physiological plausibility.
RESULTS: Both methods successfully produced synthetic cohorts (n=100). Direct generation was completed in 23 seconds via a single prompt, while the code-augmented method required 40 seconds and 12 iterations for refinement. The code-augmented methodology demonstrated better distributional fidelity: the synthetic mean age was 67.5 years (Original: 66.6) and mean BMI was 27.6 (Original: 28.1). Categorical distributions for sex (1.46 vs. 1.51) and race (1.44 vs. 1.47) were preserved, indicating that the sampling and noise injection logic maintained the trial's demographic balance.
CONCLUSIONS: While direct LLM generation offers rapid prototyping, code-based generation provides the transparency and granular statistical control essential for regulatory-grade external control arms. Calibrated Gaussian noise effectively balances data privacy with the preservation of population-level characteristics in trial-derived datasets. Future work should systematically evaluate re-identification risk under adversarial attack models and compare noise-based anonymization against alternative privacy-preserving techniques such as differential privacy.
METHODS: Using data from the Vitamin D and Omega-3 Trial (VITAL; NCT01169259), sponsored by Brigham and Women's Hospital, we conducted two experiments. The first method involved direct generation where an LLM produced a synthetic dataset in an Excel file (n=100) by interpreting the original data alongside a variable dictionary. The second method utilized a code-augmented approach where the LLM drafted a Python pipeline to perform bootstrapping and total anonymization. This script implemented an algorithmic noise filter that identified continuous numeric variables and applied additive Gaussian noise scaled to 5 percent of each variable's original standard deviation. Values were clipped to original ranges to maintain physiological plausibility.
RESULTS: Both methods successfully produced synthetic cohorts (n=100). Direct generation was completed in 23 seconds via a single prompt, while the code-augmented method required 40 seconds and 12 iterations for refinement. The code-augmented methodology demonstrated better distributional fidelity: the synthetic mean age was 67.5 years (Original: 66.6) and mean BMI was 27.6 (Original: 28.1). Categorical distributions for sex (1.46 vs. 1.51) and race (1.44 vs. 1.47) were preserved, indicating that the sampling and noise injection logic maintained the trial's demographic balance.
CONCLUSIONS: While direct LLM generation offers rapid prototyping, code-based generation provides the transparency and granular statistical control essential for regulatory-grade external control arms. Calibrated Gaussian noise effectively balances data privacy with the preservation of population-level characteristics in trial-derived datasets. Future work should systematically evaluate re-identification risk under adversarial attack models and compare noise-based anonymization against alternative privacy-preserving techniques such as differential privacy.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
PT12
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
SDC: Cardiovascular Disorders (including MI, Stroke, Circulatory)