Program
In-person AND virtual! – We are pioneering a new conference format that will connect in-person and virtual audiences to create a unique experience. Matching the innovation that comes through our members’ work, ISPOR is pushing the boundaries
of innovation to design an event that works in today’s quickly changing environment.
In-person registration included the full virtual experience, and virtual-only attendees will be able to tune into live in-person sessions and/or
watch captured in-person sessions on-demand in addition to having a variety of virtual-only sessions to attend.
Data Sampling Methods for Imbalanced Classification: A Random Forest Study for Predicting Treatment Switching in Multiple Sclerosis
Speaker(s)
Li J, Huang Y, Aparasu RR
University of Houston, College of Pharmacy, Houston, TX, USA
Presentation Documents
OBJECTIVES: Imbalanced data remains a challenge for utilizing random forest (RF) algorithms in healthcare research. This study evaluated different sampling methods in RF models for predicting treatment switching among patients with multiple sclerosis (MS).
METHODS: This study involved electronic medical records of adults with ≥1 Disease-Modifying Agent (DMA) and ≥1 MS diagnosis from September 2010-May 2017 TriNetX data. The earliest DMA date was assigned as the index date, and patients receiving DMA other than their index DMA prescription during follow-up were considered as switched. Patients were also required to have ≥1 outpatient visit and ≥1 prescription in 12 months pre- and 24 months post-index. RF models involving 72 baseline variables were trained using 70% of the randomly split data. Three sampling methods were evaluated, including up-sampling, down-sampling, and synthetic minority over-sampling techniques (SMOTE). RF classifiers and parameter tuning were implemented among resampled data to train RF models. The model performance of different sampling methods was examined using the Area Under the Curves (AUC), accuracy, recall, and F-1 score.
RESULTS: The analytical sample consisted of 6,097 (84.0%) unswitched and 1,161(16.0%) switched patients with MS. The three leading factors associated with treatment switching were: age, type of the index DMA, and year of the index date. The up-sampling method achieved the best model performance with an AUC of 0.65 (accuracy 61%, recall 60%, and F1 score 72%), followed by down-sampling with an AUC of 0.63 (accuracy 62%, recall 63%, and F1 score 74%), and SMOTE with an AUC of 0.60 (accuracy 80%, recall 94%, and F1 score 89%).
CONCLUSIONS: All sampling methods alleviated data imbalance problems. However, the over-sampling method provided the best AUC over other methods for predicting treatment switch in MS. Therefore, multiple sampling methods should be evaluated based on the extent of imbalance for increasing the performance of RF models.
Code
MSR32
Topic
Epidemiology & Public Health, Methodological & Statistical Research, Patient-Centered Research, Study Approaches
Topic Subcategory
Adherence, Persistence, & Compliance, Artificial Intelligence, Machine Learning, Predictive Analytics, Electronic Medical & Health Records, Safety & Pharmacoepidemiology
Disease
Neurological Disorders