Augmenting Small Training Datasets Improves Machine Learning Prognostic Performance

Author(s)

Dan Liu, Ph.D.¹, Samer El Kababji, Ph.D.¹, Nicholas Mitsakakis, Ph.D.¹, Lisa Pilgram, M.D.¹, Thomas Walters, M.Sc.², Mark Clemons, M.D.³, Greg Pond, Ph.D.⁴, Alaa El-Hussuna, Ph.D.⁵, Khaled El Emam, Ph.D.⁶.
¹CHEO Research Institute, Ottawa, ON, Canada, ²Hospital for Sick Children, Toronto, ON, Canada, ³Ottawa Hospital Research Institute, Ottawa, ON, Canada, ⁴McMaster University, Hamilton, ON, Canada, ⁵OpenSourceResearch, Aalborg, Denmark, ⁶University of Ottawa, Ottawa, ON, Canada.

Presentation Documents

ISPOR2025_data_aug_v4.pdf

OBJECTIVES: Small data are common, for example in pediatrics and for rare diseases. This makes it more challenging to train machine learning (ML) models as they may fail to converge and are less likely to generalize well to unseen data. Data augmentation has received increasing interest as an effective solution to address the small data challenge, especially for imaging and time series data. However, it is rarely examined for structured tabular data. This study aims to evaluate data augmentation using generative models on tabular health data.
METHODS: We performed large-scale simulations to examine four generative models to augment data of varying sizes, including deep learning models. Moreover, we also developed a decision support tool based on data characteristics to help end-users determine when augmentation would be beneficial to prognostic model performance prior to generating additional data.
RESULTS: We discover that augmentation can increase ML model performance for datasets that are smaller, more complex, more balanced, or with lower baseline AUC. Case studies on two small oncology datasets have 0.7668 and 0.7780 after augmentation relative to the baseline AUC of 0.7161 and 0.7171, which is a 7.08% and 8.5% improvement in ML prognostic accuracy, respectively.
CONCLUSIONS: This study demonstrates that data augmentation using generative models can remarkably improve prognostic performance, but only for datasets that meet baseline data size and complexity criteria. Our decision model can help analysts decide if augmentation can be useful.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR73

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

SDC: Pediatrics, SDC: Rare & Orphan Diseases

Presentation (CTI)