COMPARISON OF SAMPLING STRATEGIES TO ADDRESS SEVERE DATA IMBALANCE AND COMPUTATIONAL BENCHMARKING FOR TIME-TO-EVENT PREDICTIVE MODEL DEVELOPMENT ACROSS LOCAL AND HIGH-PERFORMANCE COMPUTING ENVIRONMENTS: PREDICTION OF ALCOHOL USE DISORDER AND OPIOID...
Author(s)
Allen M. Smith, PharmD1, Bradley C. Martin, RPh, PharmD, PhD2, Horacio Gomez-Acevedo, PhD3, Corey J. Hayes, MPH, PharmD, PhD2, Melody Greer, PhD3, Chenghui Li, PhD3;
1University of Arkansas for Medical Sciences (UAMS), Post Doctoral Fellow, Little Rock, AR, USA, 2University of Arkansas for Medical Sciences (UAMS), Little Rock, AR, USA, 3University of Arkansas for Medical Sciences, Little Rock, AR, USA
1University of Arkansas for Medical Sciences (UAMS), Post Doctoral Fellow, Little Rock, AR, USA, 2University of Arkansas for Medical Sciences (UAMS), Little Rock, AR, USA, 3University of Arkansas for Medical Sciences, Little Rock, AR, USA
OBJECTIVES: This study evaluates the computational demands of training landmark supermodels across three computing environments and compares sampling strategies to address severe data imbalance.
METHODS: Arkansas statewide administrative health claims from November 2018-December 2023 were used to construct discrete-time datasets. The benchmarking case study focused on predicting opioid use disorder (OUD) and alcohol use disorder (AUD) among Arkansas medical marijuana cardholders. For each outcome, five classifiers were evaluated: Random Survival Forest, Support Vector Machine Survival (SVMS), Cox Proportional Hazards (CPH), Random Forest, and Logistic Regression. Models were trained using a 50:50 train-test split and multiple undersampling/oversampling ratios were compared (AUD: 1:1, 1:3, 1:10, 1:25, full data | OUD: 1:1, 1:3, 1:10, 1:25, 1:50, 1:100, full data). The best-performing configurations were identified using mean cumulative sensitivity/dynamic specificity area under the receiver operating characteristic curve (AUC-ROC) and inverse probability of censoring weighting Brier score (IPCWBS). Computational benchmarking was compared across three environments: local server serial computing, local server using Apache Spark for in-memory parallelization, and the Texas A&M Accelerating Computing for Emerging Sciences (ACES) high-performance research computing (HPRC) environment. For reproducibility, local server environments were replicated within ACES using a Singularity container (16 CPU cores | 160 GB of RAM).
RESULTS: For OUD prediction, the CPH model achieved the strongest performance [AUC-ROC (95% CI)=0.7842(0.6864,0.8272); IPCWBS (95% CI)=0.00145(0.00114,0.00179)] with a 1:100 undersampling ratio. For AUD, the SVMS model performed best [AUC-ROC (95% CI)=0.7638(0.7335,0.7871); IPCWBS (95% CI)=0.00561(0.00505,0.00626)] with a 1:25 undersampling ratio. Compared with the standard local server (OUD: 476.57 minutes; AUD: 638.90 minutes), Apache Spark achieved 6.80 to 7.15-fold speedups, while the HPRC environment achieved 45.83 to 65.11-fold speedups.
CONCLUSIONS: Parallelization substantially improves development speed, particularly when performing extensive hyperparameter tuning and training computationally intensive models. Across both outcomes, moderately imbalanced random undersampling (1:10-1:100) outperformed other sampling strategies, although differences in performance across strategies were modest.
METHODS: Arkansas statewide administrative health claims from November 2018-December 2023 were used to construct discrete-time datasets. The benchmarking case study focused on predicting opioid use disorder (OUD) and alcohol use disorder (AUD) among Arkansas medical marijuana cardholders. For each outcome, five classifiers were evaluated: Random Survival Forest, Support Vector Machine Survival (SVMS), Cox Proportional Hazards (CPH), Random Forest, and Logistic Regression. Models were trained using a 50:50 train-test split and multiple undersampling/oversampling ratios were compared (AUD: 1:1, 1:3, 1:10, 1:25, full data | OUD: 1:1, 1:3, 1:10, 1:25, 1:50, 1:100, full data). The best-performing configurations were identified using mean cumulative sensitivity/dynamic specificity area under the receiver operating characteristic curve (AUC-ROC) and inverse probability of censoring weighting Brier score (IPCWBS). Computational benchmarking was compared across three environments: local server serial computing, local server using Apache Spark for in-memory parallelization, and the Texas A&M Accelerating Computing for Emerging Sciences (ACES) high-performance research computing (HPRC) environment. For reproducibility, local server environments were replicated within ACES using a Singularity container (16 CPU cores | 160 GB of RAM).
RESULTS: For OUD prediction, the CPH model achieved the strongest performance [AUC-ROC (95% CI)=0.7842(0.6864,0.8272); IPCWBS (95% CI)=0.00145(0.00114,0.00179)] with a 1:100 undersampling ratio. For AUD, the SVMS model performed best [AUC-ROC (95% CI)=0.7638(0.7335,0.7871); IPCWBS (95% CI)=0.00561(0.00505,0.00626)] with a 1:25 undersampling ratio. Compared with the standard local server (OUD: 476.57 minutes; AUD: 638.90 minutes), Apache Spark achieved 6.80 to 7.15-fold speedups, while the HPRC environment achieved 45.83 to 65.11-fold speedups.
CONCLUSIONS: Parallelization substantially improves development speed, particularly when performing extensive hyperparameter tuning and training computationally intensive models. Across both outcomes, moderately imbalanced random undersampling (1:10-1:100) outperformed other sampling strategies, although differences in performance across strategies were modest.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR157
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics