Generation of Synthetic Patient Data to Overcome Machine Learning Limitations in Healthcare Research: A Systematic Review of Methods, Performance, and Use Cases
Author(s)
Morgane Swital, PhD1, Tiphaine Porte, MSc1, Nathanael SEDMAK, MSc1, Clara Bouvard, MSc1, Arthur Gougeon, MSc2, Flavien Roux, MSc1, Frédéric Mistretta, MSc1, Audrey Lajoinie, PharmD, PhD1.
1RCTs, Lyon, France, 2Laboratory of Biometry and Evolutionary Biology, UMR 5558, CNRS, University of Lyon 1, University of Lyon, Villeurbanne, France.
1RCTs, Lyon, France, 2Laboratory of Biometry and Evolutionary Biology, UMR 5558, CNRS, University of Lyon 1, University of Lyon, Villeurbanne, France.
OBJECTIVES: The generation of synthetic patient data is an emerging strategy to support the development of machine learning (ML) models in healthcare, particularly in settings with small sample sizes or imbalanced outcomes. By artificially augmenting datasets, it allows improving model robustness and performance. This systematic literature review provides an overview of current methods, identifies common challenges, and explores their real-world applications in healthcare ML.
METHODS: A literature review was conducted on MEDLINE to identify studies published since 2020 on the generation of synthetic patient data for ML applications. Titles and abstracts [Ti/Abs] were screened, followed by full-text review for inclusion.
RESULTS: A total of 176 studies were initially identified through title and abstract screening. After full-text review, 6 studies were included. Synthetic data were generated using Generative Adversarial Networks (GANs, n=3), Synthetic Minority Over-sampling Technique (SMOTE, n=1), Conditional Tabular GAN (CTGAN, n=1), and Bayesian simulation (n=1). Data sources included clinical registries (n=2), electronic health records (n=3), and medical imaging datasets (n=1). Synthetic data were primarily used to enrich datasets with limited volume or class imbalance, enabling improved model training and evaluation. Reported outcomes demonstrated enhanced model performance, including gains in accuracy, F1-score, and AUC. For example, one study reported an increase in F1-score from 0.72 to 0.84, while another observed a 10% improvement in AUC. Robustness was assessed through cross-validation (n=4), comparison with real-world data (n=3), and sensitivity analyses (n=2). One study emphasized the use of synthetic data to preserve patient privacy while maintaining predictive validity.
CONCLUSIONS: Synthetic patient data generation is a promising strategy to improve the performance and robustness of machine learning models in healthcare. The reviewed studies show that synthetic data can effectively address data limitations while supporting privacy-preserving model development. Standardized evaluation frameworks and real-world implementation are needed to fully unlock its potential in clinical decision-making and health technology assessment.
METHODS: A literature review was conducted on MEDLINE to identify studies published since 2020 on the generation of synthetic patient data for ML applications. Titles and abstracts [Ti/Abs] were screened, followed by full-text review for inclusion.
RESULTS: A total of 176 studies were initially identified through title and abstract screening. After full-text review, 6 studies were included. Synthetic data were generated using Generative Adversarial Networks (GANs, n=3), Synthetic Minority Over-sampling Technique (SMOTE, n=1), Conditional Tabular GAN (CTGAN, n=1), and Bayesian simulation (n=1). Data sources included clinical registries (n=2), electronic health records (n=3), and medical imaging datasets (n=1). Synthetic data were primarily used to enrich datasets with limited volume or class imbalance, enabling improved model training and evaluation. Reported outcomes demonstrated enhanced model performance, including gains in accuracy, F1-score, and AUC. For example, one study reported an increase in F1-score from 0.72 to 0.84, while another observed a 10% improvement in AUC. Robustness was assessed through cross-validation (n=4), comparison with real-world data (n=3), and sensitivity analyses (n=2). One study emphasized the use of synthetic data to preserve patient privacy while maintaining predictive validity.
CONCLUSIONS: Synthetic patient data generation is a promising strategy to improve the performance and robustness of machine learning models in healthcare. The reviewed studies show that synthetic data can effectively address data limitations while supporting privacy-preserving model development. Standardized evaluation frameworks and real-world implementation are needed to fully unlock its potential in clinical decision-making and health technology assessment.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR115
Topic
Epidemiology & Public Health, Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas