FROM RETRIEVAL TO VERDICT: A HYBRID LLM PIPELINE FOR EVALUATING MEDICAL AND ECONOMIC CLAIMS

Author(s)

Achilleas Livieratos, PhD1, Maria Kudela, PhD2, Yuxi Zhao, PhD2, All-shine Chen, PhD2, Junjing Lin, PhD3, Di Zhang, PhD4, Xin Luo, PhD2, Paula Angelica Ramos, MSc2, Chinyu Su, MD2, Margaret Gamalo, PhD2.
1SPAIML Scientific Working Group, New York, NY, USA, 2Pfizer, New York, NY, USA, 3Takeda Pharmaceuticals, Cambridge, MA, USA, 4Teva Pharmacieticals, New York, NY, USA.
OBJECTIVES: The process of verifying clinical and economic claims in health technology assessments (HTAs) and systematic reviews is manual and time-consuming. Conventional LLMs like GPT-4 have been effective in filtering, but they also suffer from factual noise and citation hallucinations. In this work, we propose a hybrid AI pipeline comprised of retrieval-augmented generation (RAG), LLM-based abstract re-ranking, and iterative critique using TextGrad to assist claim adjudication. Our objective was to design a transparent, evidence-based system that can generate structured verdicts (TRUE, PARTLY TRUE or FALSE) accompanied by PubMed references for HEOR and regulatory decision support.
METHODS: The pipeline comprised four stages: (1) Iterative retrieval with query expansion reformulated search queries dynamically to capture pivotal RCTs and real-world studies; (2) LLM-based abstract re-ranking (DeepSeek-R1) prioritized clinically relevant evidence (e.g., cost-effectiveness analyses, head-to-head trials); (3) TextGrad iterative critique applied gradient-style optimization, refining verdicts by penalizing unsupported statements and rewarding citation alignment; (4) Structured verdict enforcement constrained outputs to categorical judgments paired with PubMed IDs. The approach was tested on claims relating to ulcerative colitis treatments, encompassing efficacy, safety, and cost-effectiveness comparisons.
RESULTS: The hybrid model consistently turned noisy retrieval results into clear, citation-supported decisions. First-round retrieval pulls ~8% irrelevant abstracts; re-ranking and TextGrad refinement diminishes noise, with verdicts falling in close agreement to trial evidence/real-world data. Well-organized outputs led to higher precision and, together with the automatic PubMed referencing, promoted accountability.
CONCLUSIONS: The hybrid pipeline presented here showcases the potential of multi-stage AI architectures to improve soundness, transparency and scalability of evidence-based analysis in HEOR claim studies. RAG and re-ranking, combined with TextGrad critique, produce structured verdicts in support of HTA and payer negotiation. Beyond ulcerative colitis, the model is applicable to oncology, rare diseases and other therapeutic areas.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR175

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

SDC: Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain)

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×