Machine Learning Vs Traditional Statistics: Developing a Novel Proxy for HPV-Associated LA SCCHN
Speaker(s)
Shane O1, Schuldt R2, Patel A3, Fox D4, Schrader D1, Harun R5
1Genentech, South San Francisco, CA, USA, 2Genentech, Inc, San Francisco, CA, USA, 3Genentech, Inc., Chapel-Hill, NC, USA, 4Genentech, SAN FRANCISCO, CA, USA, 5Genentech, Inc., San Mateo, CA, USA
Presentation Documents
OBJECTIVES: Previously published real-world data (RWD) analyses in human papillomavirus (HPV)-related locally advanced squamous cell carcinoma of the head and neck (LA SCCHN) have struggled with inaccurate and unreliable proxy measures of HPV status using race, age, and tumor site. HPV status is a strong prognostic factor for overall survival and response to treatment. This analysis aimed to develop a model to predict and serve as a proxy of HPV status to improve risk stratification in RWD. The predictive performance of traditional statistical models vs. machine learning (ML) methods were also compared.
METHODS: Patients newly diagnosed with LA SCCHN of the oropharynx from 2010–2017 were identified from the SEER Head and Neck with HPV Status database. Patients were excluded if they had missing HPV status or missing data on other covariates. The probability of being HPV-positive was modeled and compared using logistic regression, stepwise logistic regression, LASSO, elastic net, stepwise elastic net, random forest, GBM, and XGBoost.
RESULTS: Our final analysis cohort included 13,645 patients, the majority being male, white, <65, and with cancer stage IVA. AUC scores ranged from 0.556–0.723. GBM and XGBoost were the highest performing models with AUC scores of 0.722 and 0.723, respectively. All models predicted HPV status better than the two existing proxy methods. SHAP analysis showed how each predictor contributed to the model’s predictions. Tumor score, year of diagnosis, socioeconomic status, marital status, age, race, and node score all significantly contributed to HPV status predictions.
CONCLUSIONS: Logistic regression performed similarly to ML models. Despite being limited to 10 covariates, we were able to develop a better-performing model than existing proxy methods. Future predictive models should compare ML and traditional statistics in datasets with more covariates to improve their performance.
Code
PT15
Topic
Epidemiology & Public Health, Methodological & Statistical Research, Real World Data & Information Systems, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Disease Classification & Coding, Registries, Reproducibility & Replicability
Disease
Infectious Disease (non-vaccine), Oncology, Reproductive & Sexual Health