OncoLlama: Quantifying Cancer Inequalities at Population Scale Using Large Language Models for Clinical Text Abstraction
Author(s)
Joe Zhang, MD1, Emily Jin, MD, MSc2, Lawrence Adams, MD2, Martin Chapman, PhD2.
1London, United Kingdom, 2Artificial Intelligence Centre for Value-Based Healthcare, London, United Kingdom.
1London, United Kingdom, 2Artificial Intelligence Centre for Value-Based Healthcare, London, United Kingdom.
OBJECTIVES: Socioeconomic disparities in cancer outcomes are well-documented, but measuring late diagnosis requires detailed registry data unavailable for many cancers and cohesive regional populations. We fine-tuned Llama 3.1 8B (OncoLlama) to extract comprehensive cancer information (diagnosis date, topography, morphology, biomarkers, staging, metastases, treatment/toxicity/progression events, physical/functional findings) from cancer documents. We investigated associations between socioeconomic deprivation and stage at diagnosis using four cancer types that lacked registry coverage.
METHODS: OncoLlama was fine-tuned on 3,500 validated samples using an expert-designed Pydantic schema. F1-scores exceeded 0.98 for all key variables in local documents. Cancer documents between 2021 and 2025 from a London cancer centre serving 2.2 million people were processed with the LLM. Patients with ovarian, lung, colon, and melanoma cancers were included where staging was confirmed within 6 months of any diagnosis after 1/1/2021. Staging was derived from LLM-extracted TNM classifications, numeric stages, and anatomical spread. Age-adjusted ordered regression assessed association between deprivation and cancer stage (I-IV), and binary logistic regression assessed association between deprivation and non-metastatic / metastatic disease.
RESULTS: In-study LLM validation showed F1-scores of 0.989 (topography+date of diagnosis) and 0.992 (stage+date of staging). Inclusion criteria were met by 4,544 patients: lung (55.5%), colon (21.7%), ovarian (13.2%), melanoma (9.6%). Overall, 58.7% presented with late-stage disease (III/IV). Greater deprivation (per decile) was associated with 3.7% higher odds of advanced stage (OR: 1.037, 95% CI: 1.016-1.058, p<0.001) and 4.4% higher odds of metastatic disease (OR: 1.044, 95% CI: 1.017-1.072, p=0.0014) at diagnosis.
CONCLUSIONS: LLM-extracted staging data revealed significant associations between deprivation and advanced stage at diagnosis, indicating pathway disparities before hospital presentation. This approach demonstrates the potential of LLMs to accelerate cancer research by extracting complex clinical variables at population scale. Unused variables for this analysis include genomic biomarkers, detailed treatment timelines, and toxicity / disease progression events.
METHODS: OncoLlama was fine-tuned on 3,500 validated samples using an expert-designed Pydantic schema. F1-scores exceeded 0.98 for all key variables in local documents. Cancer documents between 2021 and 2025 from a London cancer centre serving 2.2 million people were processed with the LLM. Patients with ovarian, lung, colon, and melanoma cancers were included where staging was confirmed within 6 months of any diagnosis after 1/1/2021. Staging was derived from LLM-extracted TNM classifications, numeric stages, and anatomical spread. Age-adjusted ordered regression assessed association between deprivation and cancer stage (I-IV), and binary logistic regression assessed association between deprivation and non-metastatic / metastatic disease.
RESULTS: In-study LLM validation showed F1-scores of 0.989 (topography+date of diagnosis) and 0.992 (stage+date of staging). Inclusion criteria were met by 4,544 patients: lung (55.5%), colon (21.7%), ovarian (13.2%), melanoma (9.6%). Overall, 58.7% presented with late-stage disease (III/IV). Greater deprivation (per decile) was associated with 3.7% higher odds of advanced stage (OR: 1.037, 95% CI: 1.016-1.058, p<0.001) and 4.4% higher odds of metastatic disease (OR: 1.044, 95% CI: 1.017-1.072, p=0.0014) at diagnosis.
CONCLUSIONS: LLM-extracted staging data revealed significant associations between deprivation and advanced stage at diagnosis, indicating pathway disparities before hospital presentation. This approach demonstrates the potential of LLMs to accelerate cancer research by extracting complex clinical variables at population scale. Unused variables for this analysis include genomic biomarkers, detailed treatment timelines, and toxicity / disease progression events.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
CO173
Topic
Clinical Outcomes, Epidemiology & Public Health, Real World Data & Information Systems
Disease
Oncology