LLM Engineering in HEOR: Approaches to Improving Accuracy in Clinical Data Extraction

Author(s)

Kristian Eaves, MSc1, Mackenzie Mills, PhD2, Panos Kanavos, BSc, MSc, PhD3, Fiona Tolkmitt, BSc4, Ahmad Hecham Alani, PharmD5.
1Data Analyst, Hive Health Optimum Ltd., Pimlico, United Kingdom, 2Hive Health Optimum Ltd., LONDON, United Kingdom, 3London School of Economics and Political Science, London, United Kingdom, 4Hive Health Optimum Ltd., Pimlico, United Kingdom, 5Hive Health Optimum Ltd., London, United Kingdom.
OBJECTIVES: Advances in LLMs have offered new opportunities for data synthesis in health economics. However, LLM accuracy is limited by complexity of documents and understanding of clinical context. This research aims to assess the extent to which varying LLM engineering techniques lead to improved accuracy in data extraction of clinical data.
METHODS: Official websites from HAS, G-BA, NICE, and PBAC were screened for HTA reports in the past 5 years assessing drugs used for solid tumours (n=471). A series of LLM structured extractions were run to evaluate performance at extracting data on indirect treatment comparisons. Variables included ITC inclusion, adjustment, anchoring, matching, and overall sentiment. Gemini-2.0-Flash was used a baseline, and all results were compared against a reference data set. The methods were: adding a system prompt, advanced prompt engineering, additional HEOR context to the prompt (RAG), Gemini-2.5-flash, Gemini-2.5-Pro, different temperature values, LLM-as-a-judge, and finally, taking the mode of multiple high temperature results.
RESULTS: The baseline approach produced an average accuracy of 0.755 across the labels, this was using no system prompt, and all other default settings, with a minimal context prompt: "Extract the defined information about indirect treatment comparisons from the following report: ". The highest average accuracy was achieved by the addition of HEOR publications surrounding the use of ITCs in HTA into the prompt as context (0.809). The worst result was from the attempt to improve the baseline prompt, which dipped accuracy to 0.648. Gemini-2.5-Pro had an average accuracy of 0.788.
CONCLUSIONS: Results support using smaller models such as Gemini-2.0-Flash with additional context surrounding the task and HEOR. Overall accuracy across models remains moderate at best highlighting the continued importance of a Human-in-the-loop for LLM tasks. Small changes to prompts or additional task specific detail seems to reduce the quality of the output and may add little value.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

HTA225

Topic

Health Technology Assessment, Medical Technologies, Methodological & Statistical Research

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, Oncology

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×