Understanding Predictors of Early-Onset Colorectal Cancer in Personal Health Data with Machine Learning
Author(s)
Ashis K. Das, PhD, MD, MPH, Melinda Rossi, MPH, Michael Broder, MD, MSHS, Caitlin Sheetz, MPH;
ADVI Health, Washington, DC, USA
ADVI Health, Washington, DC, USA
Presentation Documents
OBJECTIVES: Colorectal cancer (CRC) is the second leading cause of cancer-related mortality in the United States (US), and although its incidence is stabilizing, the incidence of early-onset CRC (EOCRC, diagnosed younger than 50) is increasing. Previous studies that predicted EOCRC with machine learning (ML) primarily used electronic health record data and were limited to single centers or small geographic areas. This study aims to identify key factors that predict the likelihood of EOCRC using ML and self-reported personal health data in a nationally representative US sample.
METHODS: We conducted a retrospective analysis of National Health Interview Survey (NHIS) data from 2019-2023 of adults aged less than 50 years diagnosed with CRC. We defined the binary outcome of whether the individuals reported a CRC diagnosis. Predictors included self-reported sociodemographic factors (age, sex, race, ethnicity, urban residence, and household income-to-poverty ratio), body mass index (BMI), smoking, and health history (hypertension, diabetes, and high cholesterol). We constructed four ML models, with and without balanced bagging classifier, to distinguish individuals with EOCRC from those without: balanced random forest, gradient boost, logistic regression, and support vector machine.
RESULTS: We identified 173 individuals diagnosed with EOCRC and 63,033 cancer-free individuals. The balanced random forest with balanced bagging classifier algorithm was the best performer in terms of discriminative ability in predicting EOCRC (area under the receiver operating characteristic curve [AUC] = 0.80). AUC for other models were 0.79 (logistic regression) and 0.77 (support vector machine and gradient boost). Based on feature importance ranking using balanced random forest model, top predictors were smoking, BMI, hypertension, age, and household income-to-poverty ratio.
CONCLUSIONS: This study demonstrates the potential of using ML techniques and publicly available data to predict EOCRC, which may inform early detection and intervention of EOCRC in large US populations.
METHODS: We conducted a retrospective analysis of National Health Interview Survey (NHIS) data from 2019-2023 of adults aged less than 50 years diagnosed with CRC. We defined the binary outcome of whether the individuals reported a CRC diagnosis. Predictors included self-reported sociodemographic factors (age, sex, race, ethnicity, urban residence, and household income-to-poverty ratio), body mass index (BMI), smoking, and health history (hypertension, diabetes, and high cholesterol). We constructed four ML models, with and without balanced bagging classifier, to distinguish individuals with EOCRC from those without: balanced random forest, gradient boost, logistic regression, and support vector machine.
RESULTS: We identified 173 individuals diagnosed with EOCRC and 63,033 cancer-free individuals. The balanced random forest with balanced bagging classifier algorithm was the best performer in terms of discriminative ability in predicting EOCRC (area under the receiver operating characteristic curve [AUC] = 0.80). AUC for other models were 0.79 (logistic regression) and 0.77 (support vector machine and gradient boost). Based on feature importance ranking using balanced random forest model, top predictors were smoking, BMI, hypertension, age, and household income-to-poverty ratio.
CONCLUSIONS: This study demonstrates the potential of using ML techniques and publicly available data to predict EOCRC, which may inform early detection and intervention of EOCRC in large US populations.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
CO92
Topic
Clinical Outcomes
Disease
No Additional Disease & Conditions/Specialized Treatment Areas, SDC: Oncology