A Comparative Assessment of LLM Agreement for Clinical Data Extraction Tasks
Author(s)
Kristian Eaves, MSc1, Viktoriia Zaitceva, MSc1, Ryan Lin, BSc, MSc1, Mackenzie Mills, PhD1, Panos Kanavos, BSc, MSc, PhD2;
1HTA Hive, London, United Kingdom, 2London School of Economics, London, United Kingdom
1HTA Hive, London, United Kingdom, 2London School of Economics, London, United Kingdom
Presentation Documents
OBJECTIVES: AI tools are showing promise in data extraction tasks for SLRs, meta-analyses, and more. However, LLMs are known to “hallucinate” and different models have their own strengths and weaknesses. In this study we evaluate the comparative output of different LLM models when extracting clinical data from HTA reports.
METHODS: HTA reports from Canada and Scotland between 1/2016 and 8/2024 had text extracted using Python (n=258). We compared GPT-4o-Mini, Llama-3.1-8B, and Gemini-Flash-1.5-002, due to popularity and cost. JSON was specified as the output format, including a breakdown of clinical trials, RWE, and ITCs. The models used the same prompt and parameters, including an example result. A similarity scoring system was devised, directly comparing clearly defined outputs and comparing descriptive similarity using SentenceTransformers, each match contributed to a higher score. Output length was measured to determine the detail in model responses. A subset of reports was manually inspected.
RESULTS: 20.5% of Llama and 4.7% of Gemini outputs (24.4% total) were not valid JSON and could not be analysed. For trials, GPT/Gemini had the greatest similarity score (5.1), Llama/Gemini had the least (3.8). ITCs had the same pattern (3.2, 2.4). GPT failed to capture RWE and could not be compared. Gemini had the largest mean output length (2539 characters), while GPT/Llama had 1430/1270. Top and bottom scoring documents identified for manual inspection, revealed GPT and Gemini as effective at extracting endpoint results, however, they had inconsistencies with reporting structure. Llama was very inconsistent in providing a JSON response and struggled to perform, with hallucinations detected.
CONCLUSIONS: Output quality varies across models. Further work could include prompt engineering, fine-tuning, additional pre-processing (semantic chunking), or modifications of the output schema. We should remain cautious about the trustworthiness of data extracted through these methods. Stakeholders must remain aware of the underlying limitations in these tools and adjust accordingly.
METHODS: HTA reports from Canada and Scotland between 1/2016 and 8/2024 had text extracted using Python (n=258). We compared GPT-4o-Mini, Llama-3.1-8B, and Gemini-Flash-1.5-002, due to popularity and cost. JSON was specified as the output format, including a breakdown of clinical trials, RWE, and ITCs. The models used the same prompt and parameters, including an example result. A similarity scoring system was devised, directly comparing clearly defined outputs and comparing descriptive similarity using SentenceTransformers, each match contributed to a higher score. Output length was measured to determine the detail in model responses. A subset of reports was manually inspected.
RESULTS: 20.5% of Llama and 4.7% of Gemini outputs (24.4% total) were not valid JSON and could not be analysed. For trials, GPT/Gemini had the greatest similarity score (5.1), Llama/Gemini had the least (3.8). ITCs had the same pattern (3.2, 2.4). GPT failed to capture RWE and could not be compared. Gemini had the largest mean output length (2539 characters), while GPT/Llama had 1430/1270. Top and bottom scoring documents identified for manual inspection, revealed GPT and Gemini as effective at extracting endpoint results, however, they had inconsistencies with reporting structure. Llama was very inconsistent in providing a JSON response and struggled to perform, with hallucinations detected.
CONCLUSIONS: Output quality varies across models. Further work could include prompt engineering, fine-tuning, additional pre-processing (semantic chunking), or modifications of the output schema. We should remain cautious about the trustworthiness of data extracted through these methods. Stakeholders must remain aware of the underlying limitations in these tools and adjust accordingly.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR17
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas