Machine Learning-Based Prediction of Race and Ethnicity From Name and Location in Walmart Pharmacy Data to Support Patient Engagement and Enable Clinical Trial Diversity

Speaker(s)

ABSTRACT WITHDRAWN

OBJECTIVES: Race/ethnicity are commonly missing within healthcare databases; among Walmart pharmacy patients about 85% is missing, limiting the ability to address disparities in patient engagement & trial recruitment programs. Bayesian improved surname coding (BISG) predicts self-reported race/ethnicity from census data. It is used across industries yet machine learning (ML) techniques have improved performance over BISG. Two ML models were created to predict missing race/ethnicity in Walmart pharmacy patients.

METHODS: Since 2020, the Walmart pharmacy database has self-reported race/ethnicity (=reference standard) for 11.8 million patients, including Asian (6.3%), Black (10.2%), Hispanic (23.1%), and White (60.4%). Among patients with known race/ethnicity, the database was split into 75:25 training:testing datasets using stratified sampling. Model inputs included: first and last name, zip code, and linked zip code data. Race predictions were performed using gradient boosting (XGBoost), deep learning (DL), and BISG. Overall AUROC value and F1 scores by race categories were used to identify the best fitting model.

RESULTS: XGBoost resulted in an AUROC of 0.92 and the following F1 scores: Asian 0.87, Black 0.78, Hispanic 0.84, White 0.85, and overall 0.83. DL resulted in an AUROC of 0.93 and the following F1 scores: Asian 0.82, Black 0.62, Hispanic 0.81, White 0.89, and overall 0.85. BISG resulted in an AUROC of 0.54 and the following F1 scores: Asian 0.72, Black 0.55, Hispanic 0.73, and White 0.96, and overall 0.55. BISG failed to predict race for 16% of patients (n=1,915,579). XGBoost and DL AUROC and overall F1 scores were similar, but XGBoost outperformed in multi-class predictions. DL performed well across all races except Black. BISG only predicted White race well.

CONCLUSIONS: The tested ML models with an AUROC of 0.92 can be reliably used to impute race/ethnicity in healthcare databases, which can be deployed to improve diversity and representation within patient engagement efforts.

Code

MSR48

Topic

Health Policy & Regulatory, Methodological & Statistical Research, Patient-Centered Research

Topic Subcategory

Health Disparities & Equity, Missing Data, Patient Engagement

Disease

No Additional Disease & Conditions/Specialized Treatment Areas