Estimating Sample Size for Training Ensemble Machine Learning Models
Author(s)
Nicholas Mitsakakis, PhD, Dan Liu, PhD, Khaled El Emam, PhD.
Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.
Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada.
Presentation Documents
OBJECTIVES: Machine Learning (ML) models can efficiently analyze large datasets and capture complex variable relationships, essential for health economics and outcomes research (HEOR). Currently, there is a lack of guidance to determine sample size requirements for studies using these methods.
METHODS: We addressed this gap by investigating the relationship between sample size and performance of ML models with binary outcomes, using a large-scale simulation study; and by constructing an easy-to-use calculator for determining sufficient sample size when training these models. We used real health data as “population” datasets, and trained, tuned and assessed several ensemble ML models (LGBM, random forest, XGBoost). We repeated the process using varying sized sample data from the populations and compared the models’ performance with the population performance, measured by the Area Under the ROC Curve. Subsequently, we trained a large model to estimate the certainty of obtaining an adequately performing model based on the size and other characteristics of the dataset. We used these results to construct a sample size calculator, and we compared its performance against three common heuristics and one statistical approach.
RESULTS: Our calculator had significantly better accuracy than other methods. For example, the median relative error sample size prediction was 10% to achieve 85% of the population performance with 90% certainty for LGBM, while it was 142,000% for using the “300 observations per variable” heuristic and 7000% for the “15 observations per variable rule”, commonly suggested in the literature. Furthermore, the sample size estimators developed for regression models but often used for ML models significantly overestimate the required sample size by >4,000%.
CONCLUSIONS: The proposed sample size calculator provides a more accurate approach for determining the appropriate sample size for tree-based ensemble ML models. Our methodology can guide prioritization for HEOR studies that use ML models, ensuring efficient resource allocation for informing policy decision making.
METHODS: We addressed this gap by investigating the relationship between sample size and performance of ML models with binary outcomes, using a large-scale simulation study; and by constructing an easy-to-use calculator for determining sufficient sample size when training these models. We used real health data as “population” datasets, and trained, tuned and assessed several ensemble ML models (LGBM, random forest, XGBoost). We repeated the process using varying sized sample data from the populations and compared the models’ performance with the population performance, measured by the Area Under the ROC Curve. Subsequently, we trained a large model to estimate the certainty of obtaining an adequately performing model based on the size and other characteristics of the dataset. We used these results to construct a sample size calculator, and we compared its performance against three common heuristics and one statistical approach.
RESULTS: Our calculator had significantly better accuracy than other methods. For example, the median relative error sample size prediction was 10% to achieve 85% of the population performance with 90% certainty for LGBM, while it was 142,000% for using the “300 observations per variable” heuristic and 7000% for the “15 observations per variable rule”, commonly suggested in the literature. Furthermore, the sample size estimators developed for regression models but often used for ML models significantly overestimate the required sample size by >4,000%.
CONCLUSIONS: The proposed sample size calculator provides a more accurate approach for determining the appropriate sample size for tree-based ensemble ML models. Our methodology can guide prioritization for HEOR studies that use ML models, ensuring efficient resource allocation for informing policy decision making.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR120
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas