LLM Engineering in HEOR: Approaches to Improving Accuracy in Clinical Data Extraction
Author(s)
Kristian Eaves, MSc1, Mackenzie Mills, PhD2, Panos Kanavos, BSc, MSc, PhD3, Fiona Tolkmitt, BSc4, Ahmad Hecham Alani, PharmD5.
1Data Analyst, Hive Health Optimum Ltd., Pimlico, United Kingdom, 2Hive Health Optimum Ltd., LONDON, United Kingdom, 3London School of Economics and Political Science, London, United Kingdom, 4Hive Health Optimum Ltd., Pimlico, United Kingdom, 5Hive Health Optimum Ltd., London, United Kingdom.
1Data Analyst, Hive Health Optimum Ltd., Pimlico, United Kingdom, 2Hive Health Optimum Ltd., LONDON, United Kingdom, 3London School of Economics and Political Science, London, United Kingdom, 4Hive Health Optimum Ltd., Pimlico, United Kingdom, 5Hive Health Optimum Ltd., London, United Kingdom.
OBJECTIVES: Advances in LLMs have offered new opportunities for data synthesis in health economics. However, LLM accuracy is limited by complexity of documents and understanding of clinical context. This research aims to assess the extent to which varying LLM engineering techniques lead to improved accuracy in data extraction of clinical data.
METHODS: Official websites from HAS, G-BA, NICE, and PBAC were screened for HTA reports in the past 5 years assessing drugs used for solid tumours (n=471). A series of LLM structured extractions were run to evaluate performance at extracting data on indirect treatment comparisons. Variables included ITC inclusion, adjustment, anchoring, matching, and overall sentiment. Gemini-2.0-Flash was used a baseline, and all results were compared against a reference data set. The methods were: adding a system prompt, advanced prompt engineering, additional HEOR context to the prompt (RAG), Gemini-2.5-flash, Gemini-2.5-Pro, different temperature values, LLM-as-a-judge, and finally, taking the mode of multiple high temperature results.
RESULTS: The baseline approach produced an average accuracy of 0.755 across the labels, this was using no system prompt, and all other default settings, with a minimal context prompt: "Extract the defined information about indirect treatment comparisons from the following report: ". The highest average accuracy was achieved by the addition of HEOR publications surrounding the use of ITCs in HTA into the prompt as context (0.809). The worst result was from the attempt to improve the baseline prompt, which dipped accuracy to 0.648. Gemini-2.5-Pro had an average accuracy of 0.788.
CONCLUSIONS: Results support using smaller models such as Gemini-2.0-Flash with additional context surrounding the task and HEOR. Overall accuracy across models remains moderate at best highlighting the continued importance of a Human-in-the-loop for LLM tasks. Small changes to prompts or additional task specific detail seems to reduce the quality of the output and may add little value.
METHODS: Official websites from HAS, G-BA, NICE, and PBAC were screened for HTA reports in the past 5 years assessing drugs used for solid tumours (n=471). A series of LLM structured extractions were run to evaluate performance at extracting data on indirect treatment comparisons. Variables included ITC inclusion, adjustment, anchoring, matching, and overall sentiment. Gemini-2.0-Flash was used a baseline, and all results were compared against a reference data set. The methods were: adding a system prompt, advanced prompt engineering, additional HEOR context to the prompt (RAG), Gemini-2.5-flash, Gemini-2.5-Pro, different temperature values, LLM-as-a-judge, and finally, taking the mode of multiple high temperature results.
RESULTS: The baseline approach produced an average accuracy of 0.755 across the labels, this was using no system prompt, and all other default settings, with a minimal context prompt: "Extract the defined information about indirect treatment comparisons from the following report: ". The highest average accuracy was achieved by the addition of HEOR publications surrounding the use of ITCs in HTA into the prompt as context (0.809). The worst result was from the attempt to improve the baseline prompt, which dipped accuracy to 0.648. Gemini-2.5-Pro had an average accuracy of 0.788.
CONCLUSIONS: Results support using smaller models such as Gemini-2.0-Flash with additional context surrounding the task and HEOR. Overall accuracy across models remains moderate at best highlighting the continued importance of a Human-in-the-loop for LLM tasks. Small changes to prompts or additional task specific detail seems to reduce the quality of the output and may add little value.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
HTA225
Topic
Health Technology Assessment, Medical Technologies, Methodological & Statistical Research
Disease
No Additional Disease & Conditions/Specialized Treatment Areas, Oncology