Data Sampling Methods for Imbalanced Classification: A Random Forest Study for Predicting Treatment Switching in Multiple Sclerosis
Author(s)
Li J, Huang Y, Aparasu RR
University of Houston, College of Pharmacy, Houston, TX, USA
Presentation Documents
OBJECTIVES: Imbalanced data remains a challenge for utilizing random forest (RF) algorithms in healthcare research. This study evaluated different sampling methods in RF models for predicting treatment switching among patients with multiple sclerosis (MS).
METHODS: This study involved electronic medical records of adults with ≥1 Disease-Modifying Agent (DMA) and ≥1 MS diagnosis from September 2010-May 2017 TriNetX data. The earliest DMA date was assigned as the index date, and patients receiving DMA other than their index DMA prescription during follow-up were considered as switched. Patients were also required to have ≥1 outpatient visit and ≥1 prescription in 12 months pre- and 24 months post-index. RF models involving 72 baseline variables were trained using 70% of the randomly split data. Three sampling methods were evaluated, including up-sampling, down-sampling, and synthetic minority over-sampling techniques (SMOTE). RF classifiers and parameter tuning were implemented among resampled data to train RF models. The model performance of different sampling methods was examined using the Area Under the Curves (AUC), accuracy, recall, and F-1 score.
RESULTS: The analytical sample consisted of 6,097 (84.0%) unswitched and 1,161(16.0%) switched patients with MS. The three leading factors associated with treatment switching were: age, type of the index DMA, and year of the index date. The up-sampling method achieved the best model performance with an AUC of 0.65 (accuracy 61%, recall 60%, and F1 score 72%), followed by down-sampling with an AUC of 0.63 (accuracy 62%, recall 63%, and F1 score 74%), and SMOTE with an AUC of 0.60 (accuracy 80%, recall 94%, and F1 score 89%).
CONCLUSIONS: All sampling methods alleviated data imbalance problems. However, the over-sampling method provided the best AUC over other methods for predicting treatment switch in MS. Therefore, multiple sampling methods should be evaluated based on the extent of imbalance for increasing the performance of RF models.
Conference/Value in Health Info
Value in Health, Volume 25, Issue 6, S1 (June 2022)
Code
MSR32
Topic
Epidemiology & Public Health, Methodological & Statistical Research, Patient-Centered Research, Study Approaches
Topic Subcategory
Adherence, Persistence, & Compliance, Artificial Intelligence, Machine Learning, Predictive Analytics, Electronic Medical & Health Records, Safety & Pharmacoepidemiology
Disease
Neurological Disorders