COMPARISON OF SAMPLING STRATEGIES TO ADDRESS SEVERE DATA IMBALANCE AND COMPUTATIONAL BENCHMARKING FOR TIME-TO-EVENT PREDICTIVE MODEL DEVELOPMENT ACROSS LOCAL AND HIGH-PERFORMANCE COMPUTING ENVIRONMENTS: PREDICTION OF ALCOHOL USE DISORDER AND OPIOID...

Author(s)

Allen M. Smith, PharmD¹, Bradley C. Martin, RPh, PharmD, PhD², Horacio Gomez-Acevedo, PhD³, Corey J. Hayes, MPH, PharmD, PhD², Melody Greer, PhD³, Chenghui Li, PhD³;
¹University of Arkansas for Medical Sciences (UAMS), Post Doctoral Fellow, Little Rock, AR, USA, ²University of Arkansas for Medical Sciences (UAMS), Little Rock, AR, USA, ³University of Arkansas for Medical Sciences, Little Rock, AR, USA

Presentation Documents

ISPOR26_Smith_MSR157_POSTER.pdf

OBJECTIVES: This study evaluates the computational demands of training landmark supermodels across three computing environments and compares sampling strategies to address severe data imbalance.
METHODS: Arkansas statewide administrative health claims from November 2018-December 2023 were used to construct discrete-time datasets. The benchmarking case study focused on predicting opioid use disorder (OUD) and alcohol use disorder (AUD) among Arkansas medical marijuana cardholders. For each outcome, five classifiers were evaluated: Random Survival Forest, Support Vector Machine Survival (SVMS), Cox Proportional Hazards (CPH), Random Forest, and Logistic Regression. Models were trained using a 50:50 train-test split and multiple undersampling/oversampling ratios were compared (AUD: 1:1, 1:3, 1:10, 1:25, full data | OUD: 1:1, 1:3, 1:10, 1:25, 1:50, 1:100, full data). The best-performing configurations were identified using mean cumulative sensitivity/dynamic specificity area under the receiver operating characteristic curve (AUC-ROC) and inverse probability of censoring weighting Brier score (IPCWBS). Computational benchmarking was compared across three environments: local server serial computing, local server using Apache Spark for in-memory parallelization, and the Texas A&M Accelerating Computing for Emerging Sciences (ACES) high-performance research computing (HPRC) environment. For reproducibility, local server environments were replicated within ACES using a Singularity container (16 CPU cores | 160 GB of RAM).
RESULTS: For OUD prediction, the CPH model achieved the strongest performance [AUC-ROC (95% CI)=0.7842(0.6864,0.8272); IPCWBS (95% CI)=0.00145(0.00114,0.00179)] with a 1:100 undersampling ratio. For AUD, the SVMS model performed best [AUC-ROC (95% CI)=0.7638(0.7335,0.7871); IPCWBS (95% CI)=0.00561(0.00505,0.00626)] with a 1:25 undersampling ratio. Compared with the standard local server (OUD: 476.57 minutes; AUD: 638.90 minutes), Apache Spark achieved 6.80 to 7.15-fold speedups, while the HPRC environment achieved 45.83 to 65.11-fold speedups.
CONCLUSIONS: Parallelization substantially improves development speed, particularly when performing extensive hyperparameter tuning and training computationally intensive models. Across both outcomes, moderately imbalanced random undersampling (1:10-1:100) outperformed other sampling strategies, although differences in performance across strategies were modest.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR157

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Presentation (CTI)