Comparing Traditional and Large Language Models for Extracting Breast and Endometrial Related Clinical Features From Electronic Health Records

Speaker(s)

Tait K1, Cronin J2, Wallis J2, Dürichen R3
1Arcturis Data Ltd, Thirsk, NYK, UK, 2Arcturis Data Ltd, Oxford, UK, 3Arcturis Data Ltd, Kidlington, OXF, UK

OBJECTIVES: Advancement of general-purpose large language models (LLMs) offers promising opportunities to extract oncology clinical markers at scale from unstructured real-world data (RWD). However, their accuracy compared to smaller, domain-specific natural language processing (NLP) models is uncertain. Key clinical features for endometrial and breast cancer are documented in unstructured text, making extraction challenging. This study evaluates the ability of LLMs versus ArcTEX—a task-specific smaller NLP model—to accurately extract clinical markers from unstructured RWD.

METHODS: We analyzed 77,693 anonymized English pathology reports from Oxford University Hospital Foundation Trust, annotating a subset of 2151 randomly selected reports covering 13 clinical markers - for endometrial cancer: FIGO score, grade, p53, MMR, MLH1, MSH2, MSH6, PMS2, myometrial invasion, and lymphovascular invasion, and for endometrial cancer: HER2, ER, and PR. The objective was to extract marker values or report their absence. We investigated open-source LLMs (Llama-2-7B, Llama-2-7b-chat, Llama-3-8B, Llama-3-8B-Instruct) using various prompt techniques (zero-shot, few-shot, role-based) and fine-tuning on 3568 annotated samples. The LLMs were asked to answer in JSON formatting to facilitate postprocessing. The comparison model, ArcTEX, is a BioBERT fine-tuned question-answering model using the same training data, with outputs classified by a domain-adapted SetFit classifier optimized using an unsupervised denoising auto-encoder technique.

RESULTS: The best-performing open-source LLM was Llama-3-8B-Instruct with role-based prompts, achieving a mean (standard deviation) accuracy of 91.38% (6.7%) across clinical markers after fine-tuning. The ArcTEX model showed a superior mean accuracy of 97.62% and lower standard deviation (1.5%).

CONCLUSIONS: ArcTEX outperforms general-purpose LLMs even when finetuned in extracting oncology related clinical markers from unstructured RWD. LLMs require extensive fine-tuning to approach the accuracy of domain-specific models, zero/few-shot prompting is not sufficient. Untrained LLMs suffered from wrong output formats. This and the increased variance in accuracy across clinical markers indicates that smaller, task-specific NLP models, like ArcTEX, are preferable for this application.

Code

RWD31

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Electronic Medical & Health Records

Disease

Oncology