Evaluating the Use of Large Language Models (LLM) in Summarizing Treatment Guidelines: A Pilot Study
Author(s)
Charlotte Heminsley, PhD, Natasha Hopwood, MSc, Anna Karakusevic, MSc.
RJW&partners, Filleigh, United Kingdom.
RJW&partners, Filleigh, United Kingdom.
OBJECTIVES: Under the EU’s Health Technology Assessment (HTA) Regulation, submissions must describe current clinical management, including the care pathway and variations across European-level clinical guidelines. This study evaluated the accuracy, readability and time-efficiency of LLM-generated treatment guideline summaries for inclusion in submission dossiers, compared to manual compilation.
METHODS: We used ChatGPT-4o, Gemini 2.5 Flash and Microsoft 365 CoPilot to identify and summarize treatment guidelines for acute lymphoblastic leukaemia [ALL], chronic hand eczema [CHE]), and type 1/2 diabetes in May 2025. Using a two-step process we iteratively designed a prompt to identify relevant guidelines, which were then uploaded into the LLM-powered applications with instructions for data extraction and formatting. For comparison, an experienced Medical Writer manually identified and summarised each guideline.
RESULTS: LLMs demonstrated comparable accuracy in identifying relevant guidelines, although country-specific consensus-based guidance was identified less frequently. Additional prompting improved completeness. Outputs for diabetes guidelines were more complete than for CHE and ALL. Error rates were higher for CHE and ALL due to omitted treatments and failure to differentiate treatments by patient characteristics (e.g. age or disease severity) unless explicitly prompted. Hallucinations were also observed, with LLMs populating treatment stages in the absence of specific guidance. Readability was high for ChatGPT and CoPilot, with well-formatted tables and consistent spelling, abbreviations, and punctuation. However, variation in how treatment descriptions were presented compromised consistency across summaries. LLMs showed modest time savings (~10-15%) compared to manual summarization, due to the time needed to iteratively develop and refine a usable prompt.
CONCLUSIONS: LLMs provide a foundation for identifying and summarizing treatment guidelines to support dossier development. However, a Medical Writer’s review is essential, given the human input required to ensure an output suitable for inclusion in HTA and value dossiers. Learnings from this pilot will reduce prompt design iterations, saving time in the future.
METHODS: We used ChatGPT-4o, Gemini 2.5 Flash and Microsoft 365 CoPilot to identify and summarize treatment guidelines for acute lymphoblastic leukaemia [ALL], chronic hand eczema [CHE]), and type 1/2 diabetes in May 2025. Using a two-step process we iteratively designed a prompt to identify relevant guidelines, which were then uploaded into the LLM-powered applications with instructions for data extraction and formatting. For comparison, an experienced Medical Writer manually identified and summarised each guideline.
RESULTS: LLMs demonstrated comparable accuracy in identifying relevant guidelines, although country-specific consensus-based guidance was identified less frequently. Additional prompting improved completeness. Outputs for diabetes guidelines were more complete than for CHE and ALL. Error rates were higher for CHE and ALL due to omitted treatments and failure to differentiate treatments by patient characteristics (e.g. age or disease severity) unless explicitly prompted. Hallucinations were also observed, with LLMs populating treatment stages in the absence of specific guidance. Readability was high for ChatGPT and CoPilot, with well-formatted tables and consistent spelling, abbreviations, and punctuation. However, variation in how treatment descriptions were presented compromised consistency across summaries. LLMs showed modest time savings (~10-15%) compared to manual summarization, due to the time needed to iteratively develop and refine a usable prompt.
CONCLUSIONS: LLMs provide a foundation for identifying and summarizing treatment guidelines to support dossier development. However, a Medical Writer’s review is essential, given the human input required to ensure an output suitable for inclusion in HTA and value dossiers. Learnings from this pilot will reduce prompt design iterations, saving time in the future.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR97
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
Diabetes/Endocrine/Metabolic Disorders (including obesity), Oncology, Sensory System Disorders (Ear, Eye, Dental, Skin)