FROM TEXT TO STRUCTURE: EVALUATING LLMS FOR THE EXTRACTION OF COMPLEX EVIDENCE AND UNCERTAINTY VARIABLES FROM HEALTH TECHNOLOGY ASSESSMENT REPORTS
Author(s)
Finlay McIntyre, PhD, David Kirchheimer, PhD, Sandhya Alagan, MSc, Mackenzie Mills, PhD, Panos Kanavos, PhD;
HTA-Hive, London, United Kingdom
HTA-Hive, London, United Kingdom
OBJECTIVES: This study evaluated a hybrid framework employing LLMs for the extraction of these structured clinical and economic variables from Health Technology Assessment reports and investigated the use of an "LLM-as-a-Judge" as a novel, scalable method to assess extraction accuracy.
METHODS: A sample of 150 HTA reports from multiple agencies (e.g., NICE, CADTH, TLV) was processed. The target schema included core identifiers, population criteria, and detailed evidence/uncertainty variables (categorising clinical/economic evidence, model uncertainties, real-world evidence role, and social value judgments). A hybrid extraction pipeline was implemented, using rule-based patterns for high-confidence fields and zero/few-shot prompting of a state-of-the-art LLM (e.g., GPT-5) for complex, free-text variables. To evaluate accuracy, a separate LLM-as-a-Judge was prompted to assess the congruence between source text and extracted output for each variable. These automated scores were validated against a subset of 30 human-annotated gold-standard reports.
RESULTS: The hybrid pipeline successfully populated the complex schema, with performance varying significantly by variable type. High accuracy (F1 >0.85) was achieved for structured fields (e.g., molecule, recommendation). Extraction of nuanced evidence and uncertainty variables (e.g., "type of economic model uncertainty") proved more challenging, with F1 scores ranging from 0.65 to 0.80. The LLM-as-a-Judge's accuracy assessments showed strong correlation (r > 0.75) with human judgment for factual variables but lower agreement for subjective classifications. Error analysis revealed that ambiguity in source text phrasing and the synthesis of scattered information were primary failure modes.
CONCLUSIONS: LLMs present a powerful but imperfect tool for structuring complex HTA data. A hybrid rules/LLM approach can effectively build comprehensive databases, with the LLM-as-a-Judge offering a scalable first-pass quality check. The findings provide a framework for prioritising human-in-the-loop review, focusing expert effort on the most semantically challenging evidence and uncertainty variables. This methodology enables the systematic analysis of HTA rationales and evidentiary requirements across jurisdictions.
METHODS: A sample of 150 HTA reports from multiple agencies (e.g., NICE, CADTH, TLV) was processed. The target schema included core identifiers, population criteria, and detailed evidence/uncertainty variables (categorising clinical/economic evidence, model uncertainties, real-world evidence role, and social value judgments). A hybrid extraction pipeline was implemented, using rule-based patterns for high-confidence fields and zero/few-shot prompting of a state-of-the-art LLM (e.g., GPT-5) for complex, free-text variables. To evaluate accuracy, a separate LLM-as-a-Judge was prompted to assess the congruence between source text and extracted output for each variable. These automated scores were validated against a subset of 30 human-annotated gold-standard reports.
RESULTS: The hybrid pipeline successfully populated the complex schema, with performance varying significantly by variable type. High accuracy (F1 >0.85) was achieved for structured fields (e.g., molecule, recommendation). Extraction of nuanced evidence and uncertainty variables (e.g., "type of economic model uncertainty") proved more challenging, with F1 scores ranging from 0.65 to 0.80. The LLM-as-a-Judge's accuracy assessments showed strong correlation (r > 0.75) with human judgment for factual variables but lower agreement for subjective classifications. Error analysis revealed that ambiguity in source text phrasing and the synthesis of scattered information were primary failure modes.
CONCLUSIONS: LLMs present a powerful but imperfect tool for structuring complex HTA data. A hybrid rules/LLM approach can effectively build comprehensive databases, with the LLM-as-a-Judge offering a scalable first-pass quality check. The findings provide a framework for prioritising human-in-the-loop review, focusing expert effort on the most semantically challenging evidence and uncertainty variables. This methodology enables the systematic analysis of HTA rationales and evidentiary requirements across jurisdictions.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
P18
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas