Augmenting Race and Ethnicity in a Real-World Oncology Cohort Using the Bayesian Improved Surname Geocoding Methodology

Author(s)

Gene G. Ho, Philani Mpofu, PhD, Olive M. Mbah, PhD, Cleo Ryals, PhD;
Flatiron Health, New York, NY, USA

Presentation Documents

OBJECTIVES: Race/ethnicity data missingness is a common challenge across real-world data sources and a barrier to health equity monitoring and improvement efforts. The Bayesian Improved Surname Geocoding (BISG) method is a validated approach for imputing missing race/ethnicity data; however, use of BISG in diverse, nationwide oncology cohorts has been limited. We examined the validity of BISG in an electronic health record (EHR)-derived cohort of patients with cancer, assessing race/ethnicity concordance (EHR-documented vs BISG-imputed) and associations between race/ethnicity and patient outcomes.
METHODS: This retrospective study used the nationwide Flatiron Health EHR-derived, de-identified database to analyze a US cohort of patients diagnosed with cancer between January 1, 2011, to October 31, 2024. We imputed patient-level race/ethnicity by applying BISG to names and census geography, and assessed performance using calibration and classification metrics (eg, area under the precision-recall curve [PRAUC] and kappa statistic). We estimated several multivariable Cox models assessing associations between EHR-documented race/ethnicity and several outcomes, including real-world overall survival, time to initiation, and clinical trial participation. All models were adjusted for demographic and clinical factors and were replicated using BISG-imputed race/ethnicity, incorporating probabilities as weights.
RESULTS: Our calibration cohort included 2 250 391 patients with a known surname and census geography. BISG increased the proportions of Latinx (+53.09%), non-Latinx (NL-)Asian (+38.14%), NL-Black (+34.69%), and NL-White (+17.45%). For patients with self-reported race/ethnicity, the PRAUC was 0.94 for NL-White, 0.76 for NL-Black and Latinx, and 0.67 for NL-Asian. Classification yielded 86% accuracy and a kappa of 0.66. Hazard ratios for outcomes were similar across BISG-imputed and EHR-documented race/ethnicity with differences generally not exceeding 5%.
CONCLUSIONS: BISG is an acceptable tool in augmenting missing race/ethnicity, with important utility in oncology and health equity-related RWD use cases and diversity planning.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR136

Topic

Methodological & Statistical Research

Topic Subcategory

Missing Data

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, SDC: Oncology

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×