Development and Validation of a Machine Learning-based screening Algorithm to Predict High Risk of Hepatitis C Infection
Author(s)
Suk-Chan Jang, PharmD, PhD1, Pilar Hernández Con, MD1, Chanakan Jenjai, PharmD1, Ashley Stultz, BS1, Shunhua YAN, MEd1, Debbie L Wilson, PhD1, Weihsuan Jenny Lo-Ciganic, MS, PhD2, James Huang, PhD1, Ashley Norse, MD3, Faheem Guirgis, MD1, Robert L. Cook, MD, MPH1, Christine Gage, DO3, Khoa Nguyen, PharmD1, David R. Nelson, MD1, Haesuk Park, PhD1;
1University of Florida, Gainesville, FL, USA, 2University of Pittsburgh, Pittsburg, PA, USA, 3University of Florida, Jacksonville, FL, USA
1University of Florida, Gainesville, FL, USA, 2University of Pittsburgh, Pittsburg, PA, USA, 3University of Florida, Jacksonville, FL, USA
Presentation Documents
OBJECTIVES: Hepatitis C virus (HCV) infections are rising sharply in the United States amid the opioid epidemic. Due to its asymptomatic nature, nearly half of HCV-infected individuals are unaware of their infection. This study aimed to develop and validate a machine learning-based screening tool to identify individuals at high risk of HCV infection.
METHODS: We conducted prognostic modeling with retrospective cohort data from the 2016-2023 OneFlorida+ database, an all-payer electronic health records system covering approximately 75% of Floridians. This study included individuals tested for HCV (antibody, RNA, or genotype) and evaluated 275 potential predictors during a 6-month baseline period. These predictors included sociodemographic and clinical characteristics (e.g., comorbidities, procedures, medications). Four machine learning algorithms - elastic net (EN), random forest (RF), gradient boosting machine (GBM), and deep neural network (DNN) - were developed and validated to predict HCV infection. Risk stratification was performed by deciles of the predicted risk score.
RESULTS: Among 445,624 individuals tested for HCV, 11,834 individuals (2.65%) tested positive. Training (75%) and validation samples (25%) had similar characteristics (mean age 45 years; 37% male; 54% White; 19% Medicaid). The GBM model demonstrated the best performance (C statistic [95% CI]: 0.92 [0.91-0.92]), outperforming EN (0.89 [0.88-0.89]), RF (0.85 [0.85-0.86]) and DNN (0.91 [0.90-0.91]). Using the Youden index, the GBM model achieved 79.4% sensitivity and 89.1% specificity, and a testing yield of one positive HCV case per six tests. Over 90% of HCV-positive patients were classified in the top three deciles, suggesting the potential to reduce testing by 70% through targeted screening. Key risk predictors included being non-Hispanic, White, older age, smoking, history of undergoing HIV and prothrombin time testing, and fewer outpatient visits, while commercial insurance reduced risk.
CONCLUSIONS: Machine learning algorithms can effectively predict and stratify HCV infection risk, offering a promising targeted screening tool in clinical settings.
METHODS: We conducted prognostic modeling with retrospective cohort data from the 2016-2023 OneFlorida+ database, an all-payer electronic health records system covering approximately 75% of Floridians. This study included individuals tested for HCV (antibody, RNA, or genotype) and evaluated 275 potential predictors during a 6-month baseline period. These predictors included sociodemographic and clinical characteristics (e.g., comorbidities, procedures, medications). Four machine learning algorithms - elastic net (EN), random forest (RF), gradient boosting machine (GBM), and deep neural network (DNN) - were developed and validated to predict HCV infection. Risk stratification was performed by deciles of the predicted risk score.
RESULTS: Among 445,624 individuals tested for HCV, 11,834 individuals (2.65%) tested positive. Training (75%) and validation samples (25%) had similar characteristics (mean age 45 years; 37% male; 54% White; 19% Medicaid). The GBM model demonstrated the best performance (C statistic [95% CI]: 0.92 [0.91-0.92]), outperforming EN (0.89 [0.88-0.89]), RF (0.85 [0.85-0.86]) and DNN (0.91 [0.90-0.91]). Using the Youden index, the GBM model achieved 79.4% sensitivity and 89.1% specificity, and a testing yield of one positive HCV case per six tests. Over 90% of HCV-positive patients were classified in the top three deciles, suggesting the potential to reduce testing by 70% through targeted screening. Key risk predictors included being non-Hispanic, White, older age, smoking, history of undergoing HIV and prothrombin time testing, and fewer outpatient visits, while commercial insurance reduced risk.
CONCLUSIONS: Machine learning algorithms can effectively predict and stratify HCV infection risk, offering a promising targeted screening tool in clinical settings.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
EPH75
Topic
Epidemiology & Public Health
Topic Subcategory
Public Health
Disease
SDC: Gastrointestinal Disorders, SDC: Infectious Disease (non-vaccine)