Machine Learning-Based Prediction of Race and Ethnicity From Name and Location in Walmart Pharmacy Data to Support Patient Engagement and Enable Clinical Trial Diversity
Speaker(s)
ABSTRACT WITHDRAWN
OBJECTIVES: Race/ethnicity are commonly missing within healthcare databases; among Walmart pharmacy patients about 85% is missing, limiting the ability to address disparities in patient engagement & trial recruitment programs. Bayesian improved surname coding (BISG) predicts self-reported race/ethnicity from census data. It is used across industries yet machine learning (ML) techniques have improved performance over BISG. Two ML models were created to predict missing race/ethnicity in Walmart pharmacy patients.
METHODS: Since 2020, the Walmart pharmacy database has self-reported race/ethnicity (=reference standard) for 11.8 million patients, including Asian (6.3%), Black (10.2%), Hispanic (23.1%), and White (60.4%). Among patients with known race/ethnicity, the database was split into 75:25 training:testing datasets using stratified sampling. Model inputs included: first and last name, zip code, and linked zip code data. Race predictions were performed using gradient boosting (XGBoost), deep learning (DL), and BISG. Overall AUROC value and F1 scores by race categories were used to identify the best fitting model.
RESULTS: XGBoost resulted in an AUROC of 0.92 and the following F1 scores: Asian 0.87, Black 0.78, Hispanic 0.84, White 0.85, and overall 0.83. DL resulted in an AUROC of 0.93 and the following F1 scores: Asian 0.82, Black 0.62, Hispanic 0.81, White 0.89, and overall 0.85. BISG resulted in an AUROC of 0.54 and the following F1 scores: Asian 0.72, Black 0.55, Hispanic 0.73, and White 0.96, and overall 0.55. BISG failed to predict race for 16% of patients (n=1,915,579). XGBoost and DL AUROC and overall F1 scores were similar, but XGBoost outperformed in multi-class predictions. DL performed well across all races except Black. BISG only predicted White race well.
CONCLUSIONS: The tested ML models with an AUROC of 0.92 can be reliably used to impute race/ethnicity in healthcare databases, which can be deployed to improve diversity and representation within patient engagement efforts.
Code
MSR48
Topic
Health Policy & Regulatory, Methodological & Statistical Research, Patient-Centered Research
Topic Subcategory
Health Disparities & Equity, Missing Data, Patient Engagement
Disease
No Additional Disease & Conditions/Specialized Treatment Areas