High-Fidelity Synthetic Patient Pathway Generation and Validation for NSCLC Using A Tabular Variational Autoencoder (TVAE)
Author(s)
Abir Tadmouri, MPH, PhD1, Salma Barkaoui, PhD2, Mohammed BENNANI3, Jerome Vetillard, PhD, MD2, Hadhami Mejbri, Master2.
1Pierre Fabre, Boulogne Billancourt, France, 2Qualees, Paris, France, 3PhD, QUALEES, PARIS, France.
1Pierre Fabre, Boulogne Billancourt, France, 2Qualees, Paris, France, 3PhD, QUALEES, PARIS, France.
OBJECTIVES: To enhance statistical power in evaluating treatment efficacy in oncology, we aimed to augment a real-world dataset of non-small cell lung cancer (NSCLC) patients by generating and validating high-quality synthetic data. We used 200 patients’ data from a previous study cohort and applied advanced generative AI methods to expand it while preserving clinical and statistical integrity, and information capacity.
METHODS: Treatment lines were organized sequentially, and patients were categorized by treatment outcomes (completion, medical discontinuation, death, or interruption due to complications). Key domains —demographics, medical history, line medications, adverse events, disease response, and death details— were harmonized using a corpus. Data inconsistencies were resolved, and missing values imputed using the SoftImpute algorithm. A TVAE was then employed to generate a synthetic cohort of 500 patients. We evaluated the synthetic data using a three-tier validation framework: univariate (Kolmogorov-Smirnov, Wasserstein, Kullback-Leibler divergence), multivariate (correlation and mutual information), and global Machine Learning Utility (MLU, using a Random Forest classifier, we trained models on both real and synthetic data, evaluating them on the same real test set).
RESULTS: The synthetic dataset exhibited strong performance across all validation metrics: distribution similarity (60.4%), correlation preservation (93.5%), geometric similarity (84.4%), information similarity (72.1%), and MLU score of 90.7%. The synthetic-trained model achieved 90.7% of the predictive accuracy of the real-trained model, demonstrating that the synthetic dataset effectively preserves the predictive patterns essential for machine learning. The overall quality score of 80.2% further confirms the reliability and fidelity of the generated data.
CONCLUSIONS: The high utility and fidelity of the synthetic NSCLC cohort (the most critical, as it directly measures the synthetic data’s ability to support predictive modeling like real data) underscores its readiness for downstream clinical research applications, including predictive modeling and treatment optimization. This approach offers a scalable, privacy-preserving framework for extending clinical datasets and advancing AI-driven oncology research.
METHODS: Treatment lines were organized sequentially, and patients were categorized by treatment outcomes (completion, medical discontinuation, death, or interruption due to complications). Key domains —demographics, medical history, line medications, adverse events, disease response, and death details— were harmonized using a corpus. Data inconsistencies were resolved, and missing values imputed using the SoftImpute algorithm. A TVAE was then employed to generate a synthetic cohort of 500 patients. We evaluated the synthetic data using a three-tier validation framework: univariate (Kolmogorov-Smirnov, Wasserstein, Kullback-Leibler divergence), multivariate (correlation and mutual information), and global Machine Learning Utility (MLU, using a Random Forest classifier, we trained models on both real and synthetic data, evaluating them on the same real test set).
RESULTS: The synthetic dataset exhibited strong performance across all validation metrics: distribution similarity (60.4%), correlation preservation (93.5%), geometric similarity (84.4%), information similarity (72.1%), and MLU score of 90.7%. The synthetic-trained model achieved 90.7% of the predictive accuracy of the real-trained model, demonstrating that the synthetic dataset effectively preserves the predictive patterns essential for machine learning. The overall quality score of 80.2% further confirms the reliability and fidelity of the generated data.
CONCLUSIONS: The high utility and fidelity of the synthetic NSCLC cohort (the most critical, as it directly measures the synthetic data’s ability to support predictive modeling like real data) underscores its readiness for downstream clinical research applications, including predictive modeling and treatment optimization. This approach offers a scalable, privacy-preserving framework for extending clinical datasets and advancing AI-driven oncology research.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR117
Topic
Methodological & Statistical Research, Real World Data & Information Systems, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas, Oncology