FROM RETRIEVAL TO VERDICT: A HYBRID LLM PIPELINE FOR EVALUATING MEDICAL AND ECONOMIC CLAIMS

Author(s)

Achilleas Livieratos, PhD¹, Maria Kudela, PhD², Yuxi Zhao, PhD², All-shine Chen, PhD², Junjing Lin, PhD³, Di Zhang, PhD⁴, Xin Luo, PhD², Paula Angelica Ramos, MSc², Chinyu Su, MD², Margaret Gamalo, PhD².
¹SPAIML Scientific Working Group, New York, NY, USA, ²Pfizer, New York, NY, USA, ³Takeda Pharmaceuticals, Cambridge, MA, USA, ⁴Teva Pharmacieticals, New York, NY, USA.

Presentation Documents

ISPOR26_Livieratos_POSTERB.pdf

OBJECTIVES: The process of verifying clinical and economic claims in health technology assessments (HTAs) and systematic reviews is manual and time-consuming. Conventional LLMs like GPT-4 have been effective in filtering, but they also suffer from factual noise and citation hallucinations. In this work, we propose a hybrid AI pipeline comprised of retrieval-augmented generation (RAG), LLM-based abstract re-ranking, and iterative critique using TextGrad to assist claim adjudication. Our objective was to design a transparent, evidence-based system that can generate structured verdicts (TRUE, PARTLY TRUE or FALSE) accompanied by PubMed references for HEOR and regulatory decision support.
METHODS: The pipeline comprised four stages: (1) Iterative retrieval with query expansion reformulated search queries dynamically to capture pivotal RCTs and real-world studies; (2) LLM-based abstract re-ranking (DeepSeek-R1) prioritized clinically relevant evidence (e.g., cost-effectiveness analyses, head-to-head trials); (3) TextGrad iterative critique applied gradient-style optimization, refining verdicts by penalizing unsupported statements and rewarding citation alignment; (4) Structured verdict enforcement constrained outputs to categorical judgments paired with PubMed IDs. The approach was tested on claims relating to ulcerative colitis treatments, encompassing efficacy, safety, and cost-effectiveness comparisons.
RESULTS: The hybrid model consistently turned noisy retrieval results into clear, citation-supported decisions. First-round retrieval pulls ~8% irrelevant abstracts; re-ranking and TextGrad refinement diminishes noise, with verdicts falling in close agreement to trial evidence/real-world data. Well-organized outputs led to higher precision and, together with the automatic PubMed referencing, promoted accountability.
CONCLUSIONS: The hybrid pipeline presented here showcases the potential of multi-stage AI architectures to improve soundness, transparency and scalability of evidence-based analysis in HEOR claim studies. RAG and re-ranking, combined with TextGrad critique, produce structured verdicts in support of HTA and payer negotiation. Beyond ulcerative colitis, the model is applicable to oncology, rare diseases and other therapeutic areas.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR175

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

SDC: Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain)

Presentation (CTI)