FROM TEXT TO STRUCTURE: EVALUATING LLMS FOR THE EXTRACTION OF COMPLEX EVIDENCE AND UNCERTAINTY VARIABLES FROM HEALTH TECHNOLOGY ASSESSMENT REPORTS

Author(s)

Finlay McIntyre, PhD, David Kirchheimer, PhD, Sandhya Alagan, MSc, Mackenzie Mills, PhD, Panos Kanavos, PhD;
HTA-Hive, London, United Kingdom

OBJECTIVES: This study evaluated a hybrid framework employing LLMs for the extraction of these structured clinical and economic variables from Health Technology Assessment reports and investigated the use of an "LLM-as-a-Judge" as a novel, scalable method to assess extraction accuracy.
METHODS: A sample of 150 HTA reports from multiple agencies (e.g., NICE, CADTH, TLV) was processed. The target schema included core identifiers, population criteria, and detailed evidence/uncertainty variables (categorising clinical/economic evidence, model uncertainties, real-world evidence role, and social value judgments). A hybrid extraction pipeline was implemented, using rule-based patterns for high-confidence fields and zero/few-shot prompting of a state-of-the-art LLM (e.g., GPT-5) for complex, free-text variables. To evaluate accuracy, a separate LLM-as-a-Judge was prompted to assess the congruence between source text and extracted output for each variable. These automated scores were validated against a subset of 30 human-annotated gold-standard reports.
RESULTS: The hybrid pipeline successfully populated the complex schema, with performance varying significantly by variable type. High accuracy (F1 >0.85) was achieved for structured fields (e.g., molecule, recommendation). Extraction of nuanced evidence and uncertainty variables (e.g., "type of economic model uncertainty") proved more challenging, with F1 scores ranging from 0.65 to 0.80. The LLM-as-a-Judge's accuracy assessments showed strong correlation (r > 0.75) with human judgment for factual variables but lower agreement for subjective classifications. Error analysis revealed that ambiguity in source text phrasing and the synthesis of scattered information were primary failure modes.
CONCLUSIONS: LLMs present a powerful but imperfect tool for structuring complex HTA data. A hybrid rules/LLM approach can effectively build comprehensive databases, with the LLM-as-a-Judge offering a scalable first-pass quality check. The findings provide a framework for prioritising human-in-the-loop review, focusing expert effort on the most semantically challenging evidence and uncertainty variables. This methodology enables the systematic analysis of HTA rationales and evidentiary requirements across jurisdictions.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

P18

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)