COMPARING THE PERFORMANCE OF LOGISTIC REGRESSION VERSUS MACHINE LEARNING TREE-BASED CLASSIFICATION METHODS IN PREDICTING RARE OUTCOMES USING REAL-WORLD OBSERVATIONAL DATA
Author(s)
ABSTRACT WITHDRAWN
OBJECTIVES : This study compared the performance of logistic regression versus machine learning tree-based classification methods in predicting the rare outcome of bowel impaction (BI) among chronic constipation (CC) patients. METHODS : Adult patients selected from the IBM® MarketScan® Research Databases had at least 2 CC diagnoses during 1/1/2012-9/30/2018 (first diagnosis=index). Patients were continuously enrolled for >1 year before (baseline) and after index (follow-up), and had no evidence of CC due to IBS, pregnancy, or chronic opioid use. Patients who were treated with lubiprostone or linaclotide were identified and evidence of BI during follow-up in both treated and untreated groups was the primary outcome. Demographics were assessed on the index date and clinical characteristics were assessed during baseline. Training and test datasets were created, while randomly undersampling (RUS) patients in the majority class (non-treated CC), mitigating the effects of severe class imbalance. We compared logistic regression and three tree-based machine learning techniques: Random Forests (RF), Gradient Boosted Trees (GBT), and Ensemble Gradient Boosted Trees (EGBT). Class ratios in five training datasets ranged from balanced (1:1) to highly imbalanced (1:9). Each model was run 10 times with 30 variables to predict risk of BI. RESULTS : 10,093 treated and 100,930 untreated CC patients were included. Across all comparisons, EGBT models exhibited the best performance, particularly with a 1:9 sampling ratio (AUC=0.820), followed by GBT (AUC=0.802), RF (AUC=0.799), and logistic regression (AUC=0.724). Top predictive characteristics include age (>60), female sex, comorbidity index, treatment (lubiprostone or linaclotide), and chronic obstructive pulmonary disease. CONCLUSIONS : Machine learning combined with RUS techniques can be an effective approach for predicting rare outcomes. The applied ratio used for training datasets is consequential for outcome predictions and the overall performance of the model.
Conference/Value in Health Info
2020-05, ISPOR 2020, Orlando, FL, USA
Value in Health, Volume 23, Issue 5, S1 (May 2020)
Code
PGI32
Topic
Epidemiology & Public Health, Methodological & Statistical Research, Organizational Practices
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Best Research Practices
Disease
Gastrointestinal Disorders