Augmenting Small Training Datasets Improves Machine Learning Prognostic Performance

Author(s)

Dan Liu, Ph.D.1, Samer El Kababji, Ph.D.1, Nicholas Mitsakakis, Ph.D.1, Lisa Pilgram, M.D.1, Thomas Walters, M.Sc.2, Mark Clemons, M.D.3, Greg Pond, Ph.D.4, Alaa El-Hussuna, Ph.D.5, Khaled El Emam, Ph.D.6.
1CHEO Research Institute, Ottawa, ON, Canada, 2Hospital for Sick Children, Toronto, ON, Canada, 3Ottawa Hospital Research Institute, Ottawa, ON, Canada, 4McMaster University, Hamilton, ON, Canada, 5OpenSourceResearch, Aalborg, Denmark, 6University of Ottawa, Ottawa, ON, Canada.

Presentation Documents

OBJECTIVES: Small data are common, for example in pediatrics and for rare diseases. This makes it more challenging to train machine learning (ML) models as they may fail to converge and are less likely to generalize well to unseen data. Data augmentation has received increasing interest as an effective solution to address the small data challenge, especially for imaging and time series data. However, it is rarely examined for structured tabular data. This study aims to evaluate data augmentation using generative models on tabular health data.
METHODS: We performed large-scale simulations to examine four generative models to augment data of varying sizes, including deep learning models. Moreover, we also developed a decision support tool based on data characteristics to help end-users determine when augmentation would be beneficial to prognostic model performance prior to generating additional data.
RESULTS: We discover that augmentation can increase ML model performance for datasets that are smaller, more complex, more balanced, or with lower baseline AUC. Case studies on two small oncology datasets have 0.7668 and 0.7780 after augmentation relative to the baseline AUC of 0.7161 and 0.7171, which is a 7.08% and 8.5% improvement in ML prognostic accuracy, respectively.
CONCLUSIONS: This study demonstrates that data augmentation using generative models can remarkably improve prognostic performance, but only for datasets that meet baseline data size and complexity criteria. Our decision model can help analysts decide if augmentation can be useful.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR73

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

SDC: Pediatrics, SDC: Rare & Orphan Diseases

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×