Assessing Bias in LLM-Extracted Real-World Data: A Health Equity Analysis of Access to Care and Outcomes in Metastatic Breast Cancer
Author(s)
Olive M. Mbah, PhD, MHS, Gene G. Ho, MPH, Catherine Keane, BSN, MSN, Qianyu Yuan, PhD, Cleo Ryals, PhD.
Flatiron Health, New York, NY, USA.
Flatiron Health, New York, NY, USA.
OBJECTIVES: As Large Language Models (LLMs) are increasingly used to extract clinical data from electronic health records, assessing model fairness is critical. We evaluated the reproducibility of scientific conclusions when using LLM-extracted versus human-abstracted data to examine inequities in biomarker testing and overall survival among patients with HR+/HER2- metastatic breast cancer (mBC).
METHODS: We used the US-based Flatiron Health Research Database to select women diagnosed with HR+/HER2- mBC between January 2018 and March 2025. LLMs and human abstraction were used to curate clinical variables from unstructured clinical documents. Cox proportional hazard models were used to assess associations between race/ethnicity, social determinants of health (SDOH) and: (1) biomarker testing, and (2) overall survival. We compared hazard ratios (HR) and overlap of 95% CIs across cohorts.
RESULTS: Patients in the LLM-extracted (N = 25,055) and human-abstracted cohorts (N = 8530) exhibited similar sociodemographic and clinical characteristics. Across both cohorts, Latinx, Black, and Asian patients were generally less likely to undergo biomarker testing than White patients (e.g., Black versus White, LLM: HR=0.89; 95%CI:0.84-0.93 versus human: HR=0.91; 95%CI:0.83-0.99). SDOH estimates were also similar across cohorts (e.g., patients from the most affluent neighborhoods were more likely to receive biomarker testing than patients from the least affluent neighborhoods [LLM: HR=1.23; 95%CI:1.17-1.29 versus human: HR=1.28; 95%CI:1.17-1.40]. Survival estimates were similar across cohorts, with worse survival among Black patients [LLM: HR=1.27; 95%CI:1.19-1.36 versus human: HR=1.34; 95%CI:1.21-1.48] and those living in low-income [highest versus lowest income: LLM: HR=0.75; 95%CI:0.70-0.80 versus human: HR=0.77; 95%CI:0.69-0.86], rural [LLM: HR=1.14; 95%CI:1.08-1.21 versus human: HR=1.13; 95%CI:1.02-1.26], and predominantly Black neighborhoods (LLM: HR=1.33; 95%CI:1.24-1.43 versus human: HR=1.38; 95%CI:1.21-1.57).
CONCLUSIONS: Health equity analyses using LLM-derived data mirrored findings from analyses using abstracted data, indicating model fairness and appropriateness for use in equity-focused cancer research. With appropriate validation, LLMs offer a scalable and algorithmically fair alternative to manual abstraction.
METHODS: We used the US-based Flatiron Health Research Database to select women diagnosed with HR+/HER2- mBC between January 2018 and March 2025. LLMs and human abstraction were used to curate clinical variables from unstructured clinical documents. Cox proportional hazard models were used to assess associations between race/ethnicity, social determinants of health (SDOH) and: (1) biomarker testing, and (2) overall survival. We compared hazard ratios (HR) and overlap of 95% CIs across cohorts.
RESULTS: Patients in the LLM-extracted (N = 25,055) and human-abstracted cohorts (N = 8530) exhibited similar sociodemographic and clinical characteristics. Across both cohorts, Latinx, Black, and Asian patients were generally less likely to undergo biomarker testing than White patients (e.g., Black versus White, LLM: HR=0.89; 95%CI:0.84-0.93 versus human: HR=0.91; 95%CI:0.83-0.99). SDOH estimates were also similar across cohorts (e.g., patients from the most affluent neighborhoods were more likely to receive biomarker testing than patients from the least affluent neighborhoods [LLM: HR=1.23; 95%CI:1.17-1.29 versus human: HR=1.28; 95%CI:1.17-1.40]. Survival estimates were similar across cohorts, with worse survival among Black patients [LLM: HR=1.27; 95%CI:1.19-1.36 versus human: HR=1.34; 95%CI:1.21-1.48] and those living in low-income [highest versus lowest income: LLM: HR=0.75; 95%CI:0.70-0.80 versus human: HR=0.77; 95%CI:0.69-0.86], rural [LLM: HR=1.14; 95%CI:1.08-1.21 versus human: HR=1.13; 95%CI:1.02-1.26], and predominantly Black neighborhoods (LLM: HR=1.33; 95%CI:1.24-1.43 versus human: HR=1.38; 95%CI:1.21-1.57).
CONCLUSIONS: Health equity analyses using LLM-derived data mirrored findings from analyses using abstracted data, indicating model fairness and appropriateness for use in equity-focused cancer research. With appropriate validation, LLMs offer a scalable and algorithmically fair alternative to manual abstraction.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR39
Topic
Methodological & Statistical Research, Real World Data & Information Systems
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Confounding, Selection Bias Correction, Causal Inference
Disease
Oncology