Improving Efficiency in Analysis of Real-World Data with an Automated Machine Learning Tool

Speaker(s)

Zhang Y1, Lo-Ciganic WH2, Xie H3, Iyer R1, Snyder D1, Lineman P1, Tian MY4
1Teva Branded Pharmaceutical Products R&D, West Chester, PA, USA, 2University of Pittsburgh, Pittsburgh, PA, USA, 3Teva Branded Pharmaceutical Products R&D, Inc., BLUE BELL, PA, USA, 4Teva Branded Pharmaceutical Products R&D, Skillman, NJ, USA

OBJECTIVES: Machine learning (ML) has demonstrated advantages of handling big complex healthcare real-world data (RWD) to extract patterns, insights and make informed decisions. Yet, applying ML requires sophisticated analytical and programming skills; having ML tools that automatically implement different ML methods can greatly accelerate RWE generation. We evaluated the validity and efficiency of an automated ML tool compared to traditional ML approaches in a RWD analysis.

METHODS: AutoML is a point-and-click ML tool on a cloud-based platform by Databricks Inc. that automatizes the process including identifying feature types, fine-tuning hyperparameters, training and validating ML algorithms. We leveraged this automated ML tool in a case study to build predictive models on treatment instability in schizophrenia patients initiating oral antipsychotics using 2012-2022 Merative™ MarketScan® claims databases. We trained and validated three ML methods including elastic net, random forest, and XGBoost on AutoML versus conventional self-coding ML approach using Python 3.8.10, and compared model performance and processing time between two approaches.

RESULTS: The analysis cohort included 4,671 adults; 80.9% patients had treatment instability. With 1,549 claims-based features included in the ML model development, AutoML selected XGBoost as the best model based on the highest C-statistic (0.64 vs. 0.58-0.62 using other methods) with high precision (0.87). The self-coding ML approach yielded similar prediction performance (C-statistic 0.61-0.64). Two ML approaches identified similar important features using SHAP values (e.g., emergence room visits). The AutoML only required 16% of the computational time (2 vs. 12 hours) compared to using self-coding ML approach.

CONCLUSIONS: Our case study showed the automated ML tool has the potential to democratize and augment ML applications in RWD analysis. It can generate similar predictive models as conventional self-coding ML approaches but with greater efficiency. Furthermore, with transparent sources codes and results explanations, it facilitates the subsequent optimization of ML analysis with RWD.

Code

MSR63

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

Neurological Disorders, No Additional Disease & Conditions/Specialized Treatment Areas