GENERATIVE AI SECURES SYNTHETIC DATASETS OF NSCLC TRIAL COHORTS FOR RWD ANALYSIS: A PRIVACY-PRESERVING CASE STUDY
Author(s)
Manuel Cossio, MPhil, MS1, Ramiro E. Gilardino, MSc, MD2;
1Cytel, Director, Artificial Intelligence Lead, Dubendorf, Switzerland, 2Independent, Dubendorf, Switzerland
1Cytel, Director, Artificial Intelligence Lead, Dubendorf, Switzerland, 2Independent, Dubendorf, Switzerland
OBJECTIVES: Assess whether a two-agent Large Language Model (LLM) pipeline can reliably extract and synthesize detailed clinical trial population parameters into privacy-preserving synthetic cohorts suitable for health economics and outcomes research (HEOR). The study addresses a persistent HEOR challenge: limited access to granular trial data due to patient confidentiality constraints, which restricts evidence generalization for value assessment and real-world evidence (RWE) analyses.
METHODS: We developed a sequential two-agent LLM framework applied to a published Non-Small Cell Lung Cancer (NSCLC) trial (CHRYSALIS-2; N=105). Agent 1 extracted structured population characteristics from trial reports (demographics, ECOG performance status, mutation prevalence). Agent 2 generated executable Python-based scripts to recreate synthetic patient-level datasets consistent with reported distributions. Validation occurred in two stages: (1) baseline extraction of core cohort characteristics; (2) refined prompt engineering to capture granular atypical EGFR mutation profiles (e.g., G719X). Synthetic outputs were evaluated against published source data across six predefined quality dimensions, including Fidelity (concordance with source data), Correctness (absence of fabricated features), and Reproducibility.
RESULTS: Both stages successfully generated synthetic datasets with high reproducibility. Stage 1 achieved strong performance for demographic and clinical variables but demonstrated moderate fidelity for molecular features due to incomplete extraction of atypical mutations. Stage 2 improved representation of complex mutational architecture, yielding approximately 30% higher correctness in granular mutation prevalence compared with Stage 1. However, despite overall fidelity gains, low-frequency hallucinated mutations not reported in the source trial persisted, limiting overall fidelity.
CONCLUSIONS: The present architecture offers a scalable, reproducible approach to generating privacy-compliant synthetic clinical trial populations, with potential applications in early HEOR modeling, RWE generalization, and evidence synthesis. However, the emergence of fabricated clinical features when modeling complex variables represents a risk for decision-grade use. Future research should incorporate automated validation and constraint-based verification layers to ensure synthetic data fidelity before integration into high-stakes HTA and pricing analyses.
METHODS: We developed a sequential two-agent LLM framework applied to a published Non-Small Cell Lung Cancer (NSCLC) trial (CHRYSALIS-2; N=105). Agent 1 extracted structured population characteristics from trial reports (demographics, ECOG performance status, mutation prevalence). Agent 2 generated executable Python-based scripts to recreate synthetic patient-level datasets consistent with reported distributions. Validation occurred in two stages: (1) baseline extraction of core cohort characteristics; (2) refined prompt engineering to capture granular atypical EGFR mutation profiles (e.g., G719X). Synthetic outputs were evaluated against published source data across six predefined quality dimensions, including Fidelity (concordance with source data), Correctness (absence of fabricated features), and Reproducibility.
RESULTS: Both stages successfully generated synthetic datasets with high reproducibility. Stage 1 achieved strong performance for demographic and clinical variables but demonstrated moderate fidelity for molecular features due to incomplete extraction of atypical mutations. Stage 2 improved representation of complex mutational architecture, yielding approximately 30% higher correctness in granular mutation prevalence compared with Stage 1. However, despite overall fidelity gains, low-frequency hallucinated mutations not reported in the source trial persisted, limiting overall fidelity.
CONCLUSIONS: The present architecture offers a scalable, reproducible approach to generating privacy-compliant synthetic clinical trial populations, with potential applications in early HEOR modeling, RWE generalization, and evidence synthesis. However, the emergence of fabricated clinical features when modeling complex variables represents a risk for decision-grade use. Future research should incorporate automated validation and constraint-based verification layers to ensure synthetic data fidelity before integration into high-stakes HTA and pricing analyses.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
RWD18
Topic
Real World Data & Information Systems
Topic Subcategory
Data Protection, Integrity, & Quality Assurance
Disease
SDC: Oncology