Machine Learning Vs Traditional Statistics: Developing a Novel Proxy for HPV-Associated LA SCCHN

Speaker(s)

Shane O1, Schuldt R2, Patel A3, Fox D4, Schrader D1, Harun R5
1Genentech, South San Francisco, CA, USA, 2Genentech, Inc, San Francisco, CA, USA, 3Genentech, Inc., Chapel-Hill, NC, USA, 4Genentech, SAN FRANCISCO, CA, USA, 5Genentech, Inc., San Mateo, CA, USA

OBJECTIVES: Previously published real-world data (RWD) analyses in human papillomavirus (HPV)-related locally advanced squamous cell carcinoma of the head and neck (LA SCCHN) have struggled with inaccurate and unreliable proxy measures of HPV status using race, age, and tumor site. HPV status is a strong prognostic factor for overall survival and response to treatment. This analysis aimed to develop a model to predict and serve as a proxy of HPV status to improve risk stratification in RWD. The predictive performance of traditional statistical models vs. machine learning (ML) methods were also compared.

METHODS: Patients newly diagnosed with LA SCCHN of the oropharynx from 2010–2017 were identified from the SEER Head and Neck with HPV Status database. Patients were excluded if they had missing HPV status or missing data on other covariates. The probability of being HPV-positive was modeled and compared using logistic regression, stepwise logistic regression, LASSO, elastic net, stepwise elastic net, random forest, GBM, and XGBoost.

RESULTS: Our final analysis cohort included 13,645 patients, the majority being male, white, <65, and with cancer stage IVA. AUC scores ranged from 0.556–0.723. GBM and XGBoost were the highest performing models with AUC scores of 0.722 and 0.723, respectively. All models predicted HPV status better than the two existing proxy methods. SHAP analysis showed how each predictor contributed to the model’s predictions. Tumor score, year of diagnosis, socioeconomic status, marital status, age, race, and node score all significantly contributed to HPV status predictions.

CONCLUSIONS: Logistic regression performed similarly to ML models. Despite being limited to 10 covariates, we were able to develop a better-performing model than existing proxy methods. Future predictive models should compare ML and traditional statistics in datasets with more covariates to improve their performance.

Code

PT15

Topic

Epidemiology & Public Health, Methodological & Statistical Research, Real World Data & Information Systems, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Disease Classification & Coding, Registries, Reproducibility & Replicability

Disease

Infectious Disease (non-vaccine), Oncology, Reproductive & Sexual Health