Augmenting Small Training Datasets Improves Machine Learning Prognostic Performance
Author(s)
Dan Liu, Ph.D.1, Samer El Kababji, Ph.D.1, Nicholas Mitsakakis, Ph.D.1, Lisa Pilgram, M.D.1, Thomas Walters, M.Sc.2, Mark Clemons, M.D.3, Greg Pond, Ph.D.4, Alaa El-Hussuna, Ph.D.5, Khaled El Emam, Ph.D.6.
1CHEO Research Institute, Ottawa, ON, Canada, 2Hospital for Sick Children, Toronto, ON, Canada, 3Ottawa Hospital Research Institute, Ottawa, ON, Canada, 4McMaster University, Hamilton, ON, Canada, 5OpenSourceResearch, Aalborg, Denmark, 6University of Ottawa, Ottawa, ON, Canada.
1CHEO Research Institute, Ottawa, ON, Canada, 2Hospital for Sick Children, Toronto, ON, Canada, 3Ottawa Hospital Research Institute, Ottawa, ON, Canada, 4McMaster University, Hamilton, ON, Canada, 5OpenSourceResearch, Aalborg, Denmark, 6University of Ottawa, Ottawa, ON, Canada.
Presentation Documents
OBJECTIVES: Small data are common, for example in pediatrics and for rare diseases. This makes it more challenging to train machine learning (ML) models as they may fail to converge and are less likely to generalize well to unseen data. Data augmentation has received increasing interest as an effective solution to address the small data challenge, especially for imaging and time series data. However, it is rarely examined for structured tabular data. This study aims to evaluate data augmentation using generative models on tabular health data.
METHODS: We performed large-scale simulations to examine four generative models to augment data of varying sizes, including deep learning models. Moreover, we also developed a decision support tool based on data characteristics to help end-users determine when augmentation would be beneficial to prognostic model performance prior to generating additional data.
RESULTS: We discover that augmentation can increase ML model performance for datasets that are smaller, more complex, more balanced, or with lower baseline AUC. Case studies on two small oncology datasets have 0.7668 and 0.7780 after augmentation relative to the baseline AUC of 0.7161 and 0.7171, which is a 7.08% and 8.5% improvement in ML prognostic accuracy, respectively.
CONCLUSIONS: This study demonstrates that data augmentation using generative models can remarkably improve prognostic performance, but only for datasets that meet baseline data size and complexity criteria. Our decision model can help analysts decide if augmentation can be useful.
METHODS: We performed large-scale simulations to examine four generative models to augment data of varying sizes, including deep learning models. Moreover, we also developed a decision support tool based on data characteristics to help end-users determine when augmentation would be beneficial to prognostic model performance prior to generating additional data.
RESULTS: We discover that augmentation can increase ML model performance for datasets that are smaller, more complex, more balanced, or with lower baseline AUC. Case studies on two small oncology datasets have 0.7668 and 0.7780 after augmentation relative to the baseline AUC of 0.7161 and 0.7171, which is a 7.08% and 8.5% improvement in ML prognostic accuracy, respectively.
CONCLUSIONS: This study demonstrates that data augmentation using generative models can remarkably improve prognostic performance, but only for datasets that meet baseline data size and complexity criteria. Our decision model can help analysts decide if augmentation can be useful.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR73
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
SDC: Pediatrics, SDC: Rare & Orphan Diseases