A PILOT ASSESSMENT OF LLM-GENERATED SYNTHETIC COHORTS: A FIRST STEP TOWARD ROBUST SYNTHETIC CONTROL ARMS

Author(s)

Manuel Cossio, MPhil, MS¹, Deepa Jahagirdar, PhD², Anupama Vasudevan, MPH, PhD³.
¹Director, Artificial Intelligence Lead, Cytel, Dubendorf, Switzerland, ²Cytel Inc, Bellevue, WA, USA, ³Cytel, Plano, TX, USA.

Presentation Documents

ISPOR2026US_Cossio_Vasudevan_PT12.pdf

OBJECTIVES: This study evaluates two methodologies based on large language models (LLM) for generating synthetic clinical trial datasets intended for use in External Control Arms (ECAs). It compares direct generative output against automated code execution for statistical fidelity and reproducibility.
METHODS: Using data from the Vitamin D and Omega-3 Trial (VITAL; NCT01169259), sponsored by Brigham and Women's Hospital, we conducted two experiments. The first method involved direct generation where an LLM produced a synthetic dataset in an Excel file (n=100) by interpreting the original data alongside a variable dictionary. The second method utilized a code-augmented approach where the LLM drafted a Python pipeline to perform bootstrapping and total anonymization. This script implemented an algorithmic noise filter that identified continuous numeric variables and applied additive Gaussian noise scaled to 5 percent of each variable's original standard deviation. Values were clipped to original ranges to maintain physiological plausibility.
RESULTS: Both methods successfully produced synthetic cohorts (n=100). Direct generation was completed in 23 seconds via a single prompt, while the code-augmented method required 40 seconds and 12 iterations for refinement. The code-augmented methodology demonstrated better distributional fidelity: the synthetic mean age was 67.5 years (Original: 66.6) and mean BMI was 27.6 (Original: 28.1). Categorical distributions for sex (1.46 vs. 1.51) and race (1.44 vs. 1.47) were preserved, indicating that the sampling and noise injection logic maintained the trial's demographic balance.
CONCLUSIONS: While direct LLM generation offers rapid prototyping, code-based generation provides the transparency and granular statistical control essential for regulatory-grade external control arms. Calibrated Gaussian noise effectively balances data privacy with the preservation of population-level characteristics in trial-derived datasets. Future work should systematically evaluate re-identification risk under adversarial attack models and compare noise-based anonymization against alternative privacy-preserving techniques such as differential privacy.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

PT12

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

SDC: Cardiovascular Disorders (including MI, Stroke, Circulatory)

Presentation (CTI)