The Impact of Hallucinations in Synthetic Health Data on Prognostic Machine Learning Models
Author(s)
Lisa Pilgram, MD, Samer El Kababji, PhD, Dan Liu, PhD, Khaled El Emam, PhD.
Children’s Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.
Children’s Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.
Presentation Documents
OBJECTIVES: The application of synthetic data generation (SDG) as a privacy-preserving mechanism for sharing health data is increasing. Hallucinations are commonly observed in text-generating models but may also occur in tabular SDG. Hallucinations can erode trust in the utility of the generated data. This study investigates (1) the hallucination rate (HR) during tabular SDG, and (2) whether hallucinations deteriorate the performance of downstream prognostic models.
METHODS: We used 6,354 dataset variants derived from 12 large real world health datasets by increasing their complexity. From these variants, training datasets were sampled for SDG using 7 different generative models. The hallucination rate (HR) was defined as the proportion of non-existent synthetic records in the population. Downstream utility was assessed by training a gradient boosted decision tree (GBDT) classifier on the synthetic data and testing it on a real holdout dataset.
RESULTS: The median HR was 89.6% (IQR 66.0-99.3%). The odds for hallucinations increased significantly with higher complexity (fixed effect) across all datasets (random effects). At minimum complexity, sequential decision trees had the smallest odds (1.49, 95%CI [0.39, 5.70]) and the variational autoencoder the highest odds (14.24, 95%CI [2.12, 95.54]); and complexity was positively associated with HR, from 1.03 in Bayesian Networks, 95%CI [1.01, 1.05], to 1.16 in Normalizing Flows, 95%CI [1.11, 1.22]. The effect of hallucinations on downstream utility was inconsistent across the generators with no effect in 6/7 generators and a negative effect in the Generative Adversarial Network (-0.02, 95%CI [-0.03, -0.02]).
CONCLUSIONS: Our findings suggest that hallucinated records can form a major portion of synthetic data with higher HR as complexity increased. The rate of increase in HR varied among generators with Normalizing Flows at the upper end. Hallucinations by Sequential Trees, Adversarial Random Forests, Variational Autoencoders, Normalizing Flows and Bayesian Networks did not impact the performance of GBDT.
METHODS: We used 6,354 dataset variants derived from 12 large real world health datasets by increasing their complexity. From these variants, training datasets were sampled for SDG using 7 different generative models. The hallucination rate (HR) was defined as the proportion of non-existent synthetic records in the population. Downstream utility was assessed by training a gradient boosted decision tree (GBDT) classifier on the synthetic data and testing it on a real holdout dataset.
RESULTS: The median HR was 89.6% (IQR 66.0-99.3%). The odds for hallucinations increased significantly with higher complexity (fixed effect) across all datasets (random effects). At minimum complexity, sequential decision trees had the smallest odds (1.49, 95%CI [0.39, 5.70]) and the variational autoencoder the highest odds (14.24, 95%CI [2.12, 95.54]); and complexity was positively associated with HR, from 1.03 in Bayesian Networks, 95%CI [1.01, 1.05], to 1.16 in Normalizing Flows, 95%CI [1.11, 1.22]. The effect of hallucinations on downstream utility was inconsistent across the generators with no effect in 6/7 generators and a negative effect in the Generative Adversarial Network (-0.02, 95%CI [-0.03, -0.02]).
CONCLUSIONS: Our findings suggest that hallucinated records can form a major portion of synthetic data with higher HR as complexity increased. The rate of increase in HR varied among generators with Normalizing Flows at the upper end. Hallucinations by Sequential Trees, Adversarial Random Forests, Variational Autoencoders, Normalizing Flows and Bayesian Networks did not impact the performance of GBDT.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
P16
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas