The Impact of Hallucinations in Synthetic Health Data on Prognostic Machine Learning Models

Author(s)

Lisa Pilgram, MD, Samer El Kababji, PhD, Dan Liu, PhD, Khaled El Emam, PhD.
Children’s Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.

Presentation Documents

sdg-hallucinations-ispor.pdf

OBJECTIVES: The application of synthetic data generation (SDG) as a privacy-preserving mechanism for sharing health data is increasing. Hallucinations are commonly observed in text-generating models but may also occur in tabular SDG. Hallucinations can erode trust in the utility of the generated data. This study investigates (1) the hallucination rate (HR) during tabular SDG, and (2) whether hallucinations deteriorate the performance of downstream prognostic models.
METHODS: We used 6,354 dataset variants derived from 12 large real world health datasets by increasing their complexity. From these variants, training datasets were sampled for SDG using 7 different generative models. The hallucination rate (HR) was defined as the proportion of non-existent synthetic records in the population. Downstream utility was assessed by training a gradient boosted decision tree (GBDT) classifier on the synthetic data and testing it on a real holdout dataset.
RESULTS: The median HR was 89.6% (IQR 66.0-99.3%). The odds for hallucinations increased significantly with higher complexity (fixed effect) across all datasets (random effects). At minimum complexity, sequential decision trees had the smallest odds (1.49, 95%CI [0.39, 5.70]) and the variational autoencoder the highest odds (14.24, 95%CI [2.12, 95.54]); and complexity was positively associated with HR, from 1.03 in Bayesian Networks, 95%CI [1.01, 1.05], to 1.16 in Normalizing Flows, 95%CI [1.11, 1.22]. The effect of hallucinations on downstream utility was inconsistent across the generators with no effect in 6/7 generators and a negative effect in the Generative Adversarial Network (-0.02, 95%CI [-0.03, -0.02]).
CONCLUSIONS: Our findings suggest that hallucinated records can form a major portion of synthetic data with higher HR as complexity increased. The rate of increase in HR varied among generators with Normalizing Flows at the upper end. Hallucinations by Sequential Trees, Adversarial Random Forests, Variational Autoencoders, Normalizing Flows and Bayesian Networks did not impact the performance of GBDT.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

P16

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)