Data Sampling Methods for Imbalanced Classification: A Random Forest Study for Predicting Treatment Switching in Multiple Sclerosis

Author(s)

Li J, Huang Y, Aparasu RR
University of Houston, College of Pharmacy, Houston, TX, USA

OBJECTIVES: Imbalanced data remains a challenge for utilizing random forest (RF) algorithms in healthcare research. This study evaluated different sampling methods in RF models for predicting treatment switching among patients with multiple sclerosis (MS).

METHODS: This study involved electronic medical records of adults with ≥1 Disease-Modifying Agent (DMA) and ≥1 MS diagnosis from September 2010-May 2017 TriNetX data. The earliest DMA date was assigned as the index date, and patients receiving DMA other than their index DMA prescription during follow-up were considered as switched. Patients were also required to have ≥1 outpatient visit and ≥1 prescription in 12 months pre- and 24 months post-index. RF models involving 72 baseline variables were trained using 70% of the randomly split data. Three sampling methods were evaluated, including up-sampling, down-sampling, and synthetic minority over-sampling techniques (SMOTE). RF classifiers and parameter tuning were implemented among resampled data to train RF models. The model performance of different sampling methods was examined using the Area Under the Curves (AUC), accuracy, recall, and F-1 score.

RESULTS: The analytical sample consisted of 6,097 (84.0%) unswitched and 1,161(16.0%) switched patients with MS. The three leading factors associated with treatment switching were: age, type of the index DMA, and year of the index date. The up-sampling method achieved the best model performance with an AUC of 0.65 (accuracy 61%, recall 60%, and F1 score 72%), followed by down-sampling with an AUC of 0.63 (accuracy 62%, recall 63%, and F1 score 74%), and SMOTE with an AUC of 0.60 (accuracy 80%, recall 94%, and F1 score 89%).

CONCLUSIONS: All sampling methods alleviated data imbalance problems. However, the over-sampling method provided the best AUC over other methods for predicting treatment switch in MS. Therefore, multiple sampling methods should be evaluated based on the extent of imbalance for increasing the performance of RF models.

Conference/Value in Health Info

2022-05, ISPOR 2022, Washington, DC, USA

Value in Health, Volume 25, Issue 6, S1 (June 2022)

Code

MSR32

Topic

Epidemiology & Public Health, Methodological & Statistical Research, Patient-Centered Research, Study Approaches

Topic Subcategory

Adherence, Persistence, & Compliance, Artificial Intelligence, Machine Learning, Predictive Analytics, Electronic Medical & Health Records, Safety & Pharmacoepidemiology

Disease

Neurological Disorders

Explore Related HEOR by Topic


Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×