A Comparative Assessment of LLM Agreement for Clinical Data Extraction Tasks

Moderator

Kristian Eaves, MSc, Hive Health Optimum Ltd., Pimlico, United Kingdom

Speakers

Viktoriia Zaitceva, MSc, Hive Health Optimum Ltd., London, United Kingdom; Ryan Lin, BSc, MSc; Mackenzie Mills, PhD, Hive Health Optimum Ltd., LONDON, United Kingdom; Panos Kanavos, BSc, MSc, PhD, London School of Economics and Political Science, London, United Kingdom

OBJECTIVES: AI tools are showing promise in data extraction tasks for SLRs, meta-analyses, and more. However, LLMs are known to “hallucinate” and different models have their own strengths and weaknesses. In this study we evaluate the comparative output of different LLM models when extracting clinical data from HTA reports.
METHODS: HTA reports from Canada and Scotland between 1/2016 and 8/2024 had text extracted using Python (n=258). We compared GPT-4o-Mini, Llama-3.1-8B, and Gemini-Flash-1.5-002, due to popularity and cost. JSON was specified as the output format, including a breakdown of clinical trials, RWE, and ITCs. The models used the same prompt and parameters, including an example result. A similarity scoring system was devised, directly comparing clearly defined outputs and comparing descriptive similarity using SentenceTransformers, each match contributed to a higher score. Output length was measured to determine the detail in model responses. A subset of reports was manually inspected.
RESULTS: 20.5% of Llama and 4.7% of Gemini outputs (24.4% total) were not valid JSON and could not be analysed. For trials, GPT/Gemini had the greatest similarity score (5.1), Llama/Gemini had the least (3.8). ITCs had the same pattern (3.2, 2.4). GPT failed to capture RWE and could not be compared. Gemini had the largest mean output length (2539 characters), while GPT/Llama had 1430/1270. Top and bottom scoring documents identified for manual inspection, revealed GPT and Gemini as effective at extracting endpoint results, however, they had inconsistencies with reporting structure. Llama was very inconsistent in providing a JSON response and struggled to perform, with hallucinations detected.
CONCLUSIONS: Output quality varies across models. Further work could include prompt engineering, fine-tuning, additional pre-processing (semantic chunking), or modifications of the output schema. We should remain cautious about the trustworthiness of data extracted through these methods. Stakeholders must remain aware of the underlying limitations in these tools and adjust accordingly.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR17

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)