Thus SPOKE the TA: Querying NICE Technology Assessments Using Generative AI and RAG
Author(s)
Cale Harrison, Msc1, Eddy Tye, Ug1, Will Rhodes, Ug1, Smruti Prajnya Panigrahi, MPH, Msc1, Jack Said, Msc2.
1Pfizer UK, Walton Oaks, United Kingdom, 2Pfizer UK, Manchester, United Kingdom.
1Pfizer UK, Walton Oaks, United Kingdom, 2Pfizer UK, Manchester, United Kingdom.
OBJECTIVES: This project aimed to assess the accuracy of several generative AI (GenAI) approaches in extracting and analysing key information from published NICE technology appraisals (TAs) across disease areas.
METHODS: Two different AI tools were used to extract key data from NICE TAs, Pfizer’s VOX utilising Open AI’s GPT-4o, and Microsoft Copilot also utilising GPT-4o but with a different context window. Data extracted included: incidence, prevalence, decision problem, comparator information, and reimbursement outcomes from previously published NICE TA documents. This was done using retrieval augmented generation (RAG)- recent relevant TAs were uploaded to the tools before asking questions. Analysis was ran for four different disease areas: obesity, breast cancer, prostate cancer and lung cancer. For these preliminary results each response was rated between 1 (low accuracy) and 3 (high accuracy). The key scenarios used five TAs to compare tools (VOX 5 TA RAG and Copilot 5 TA RAG), as additional validation of optimal TA throughput another scenario used one TA (VOX 1 TA RAG) and then one scenario did not use any TAs (VOX non-RAG).
RESULTS: The most accurate results were seen when using Microsoft Copilot with five TAs, giving an accurate response (3) for 54% of responses, non-hallucination result (2) for 14% and an inaccurate response (1) for 32%. The least accurate results were seen in VOX five TA, with corresponding results of most accurate (3) 11%, non-hallucination (2) 46% and inaccurate (1) of 43%. It is plausible that this is due to a lower context window in VOX.
CONCLUSIONS: Generative AI tools can extract and analyse key parameters from NICE TAs, however this is sensitive to the type of tool used and the number of documents provided. As demonstrated, GenAI still hallucinates and at this time a human in the loop (HITL) approach is still recommended.
METHODS: Two different AI tools were used to extract key data from NICE TAs, Pfizer’s VOX utilising Open AI’s GPT-4o, and Microsoft Copilot also utilising GPT-4o but with a different context window. Data extracted included: incidence, prevalence, decision problem, comparator information, and reimbursement outcomes from previously published NICE TA documents. This was done using retrieval augmented generation (RAG)- recent relevant TAs were uploaded to the tools before asking questions. Analysis was ran for four different disease areas: obesity, breast cancer, prostate cancer and lung cancer. For these preliminary results each response was rated between 1 (low accuracy) and 3 (high accuracy). The key scenarios used five TAs to compare tools (VOX 5 TA RAG and Copilot 5 TA RAG), as additional validation of optimal TA throughput another scenario used one TA (VOX 1 TA RAG) and then one scenario did not use any TAs (VOX non-RAG).
RESULTS: The most accurate results were seen when using Microsoft Copilot with five TAs, giving an accurate response (3) for 54% of responses, non-hallucination result (2) for 14% and an inaccurate response (1) for 32%. The least accurate results were seen in VOX five TA, with corresponding results of most accurate (3) 11%, non-hallucination (2) 46% and inaccurate (1) of 43%. It is plausible that this is due to a lower context window in VOX.
CONCLUSIONS: Generative AI tools can extract and analyse key parameters from NICE TAs, however this is sensitive to the type of tool used and the number of documents provided. As demonstrated, GenAI still hallucinates and at this time a human in the loop (HITL) approach is still recommended.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
P54
Topic
Health Technology Assessment, Methodological & Statistical Research, Real World Data & Information Systems
Topic Subcategory
Decision & Deliberative Processes
Disease
Diabetes/Endocrine/Metabolic Disorders (including obesity), Oncology