OncoLlama: Quantifying Cancer Inequalities at Population Scale Using Large Language Models for Clinical Text Abstraction

Author(s)

Joe Zhang, MD1, Emily Jin, MD, MSc2, Lawrence Adams, MD2, Martin Chapman, PhD2.
1London, United Kingdom, 2Artificial Intelligence Centre for Value-Based Healthcare, London, United Kingdom.
OBJECTIVES: Socioeconomic disparities in cancer outcomes are well-documented, but measuring late diagnosis requires detailed registry data unavailable for many cancers and cohesive regional populations. We fine-tuned Llama 3.1 8B (OncoLlama) to extract comprehensive cancer information (diagnosis date, topography, morphology, biomarkers, staging, metastases, treatment/toxicity/progression events, physical/functional findings) from cancer documents. We investigated associations between socioeconomic deprivation and stage at diagnosis using four cancer types that lacked registry coverage.
METHODS: OncoLlama was fine-tuned on 3,500 validated samples using an expert-designed Pydantic schema. F1-scores exceeded 0.98 for all key variables in local documents. Cancer documents between 2021 and 2025 from a London cancer centre serving 2.2 million people were processed with the LLM. Patients with ovarian, lung, colon, and melanoma cancers were included where staging was confirmed within 6 months of any diagnosis after 1/1/2021. Staging was derived from LLM-extracted TNM classifications, numeric stages, and anatomical spread. Age-adjusted ordered regression assessed association between deprivation and cancer stage (I-IV), and binary logistic regression assessed association between deprivation and non-metastatic / metastatic disease.
RESULTS: In-study LLM validation showed F1-scores of 0.989 (topography+date of diagnosis) and 0.992 (stage+date of staging). Inclusion criteria were met by 4,544 patients: lung (55.5%), colon (21.7%), ovarian (13.2%), melanoma (9.6%). Overall, 58.7% presented with late-stage disease (III/IV). Greater deprivation (per decile) was associated with 3.7% higher odds of advanced stage (OR: 1.037, 95% CI: 1.016-1.058, p<0.001) and 4.4% higher odds of metastatic disease (OR: 1.044, 95% CI: 1.017-1.072, p=0.0014) at diagnosis.
CONCLUSIONS: LLM-extracted staging data revealed significant associations between deprivation and advanced stage at diagnosis, indicating pathway disparities before hospital presentation. This approach demonstrates the potential of LLMs to accelerate cancer research by extracting complex clinical variables at population scale. Unused variables for this analysis include genomic biomarkers, detailed treatment timelines, and toxicity / disease progression events.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

CO173

Topic

Clinical Outcomes, Epidemiology & Public Health, Real World Data & Information Systems

Disease

Oncology

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×