FROM RETRIEVAL TO VERDICT: A HYBRID LLM PIPELINE FOR EVALUATING MEDICAL AND ECONOMIC CLAIMS
Author(s)
Achilleas Livieratos, PhD1, Maria Kudela, PhD2, Yuxi Zhao, PhD2, All-shine Chen, PhD2, Junjing Lin, PhD3, Di Zhang, PhD4, Xin Luo, PhD2, Paula Angelica Ramos, MSc2, Chinyu Su, MD2, Margaret Gamalo, PhD2.
1SPAIML Scientific Working Group, New York, NY, USA, 2Pfizer, New York, NY, USA, 3Takeda Pharmaceuticals, Cambridge, MA, USA, 4Teva Pharmacieticals, New York, NY, USA.
1SPAIML Scientific Working Group, New York, NY, USA, 2Pfizer, New York, NY, USA, 3Takeda Pharmaceuticals, Cambridge, MA, USA, 4Teva Pharmacieticals, New York, NY, USA.
OBJECTIVES: The process of verifying clinical and economic claims in health technology assessments (HTAs) and systematic reviews is manual and time-consuming. Conventional LLMs like GPT-4 have been effective in filtering, but they also suffer from factual noise and citation hallucinations. In this work, we propose a hybrid AI pipeline comprised of retrieval-augmented generation (RAG), LLM-based abstract re-ranking, and iterative critique using TextGrad to assist claim adjudication. Our objective was to design a transparent, evidence-based system that can generate structured verdicts (TRUE, PARTLY TRUE or FALSE) accompanied by PubMed references for HEOR and regulatory decision support.
METHODS: The pipeline comprised four stages: (1) Iterative retrieval with query expansion reformulated search queries dynamically to capture pivotal RCTs and real-world studies; (2) LLM-based abstract re-ranking (DeepSeek-R1) prioritized clinically relevant evidence (e.g., cost-effectiveness analyses, head-to-head trials); (3) TextGrad iterative critique applied gradient-style optimization, refining verdicts by penalizing unsupported statements and rewarding citation alignment; (4) Structured verdict enforcement constrained outputs to categorical judgments paired with PubMed IDs. The approach was tested on claims relating to ulcerative colitis treatments, encompassing efficacy, safety, and cost-effectiveness comparisons.
RESULTS: The hybrid model consistently turned noisy retrieval results into clear, citation-supported decisions. First-round retrieval pulls ~8% irrelevant abstracts; re-ranking and TextGrad refinement diminishes noise, with verdicts falling in close agreement to trial evidence/real-world data. Well-organized outputs led to higher precision and, together with the automatic PubMed referencing, promoted accountability.
CONCLUSIONS: The hybrid pipeline presented here showcases the potential of multi-stage AI architectures to improve soundness, transparency and scalability of evidence-based analysis in HEOR claim studies. RAG and re-ranking, combined with TextGrad critique, produce structured verdicts in support of HTA and payer negotiation. Beyond ulcerative colitis, the model is applicable to oncology, rare diseases and other therapeutic areas.
METHODS: The pipeline comprised four stages: (1) Iterative retrieval with query expansion reformulated search queries dynamically to capture pivotal RCTs and real-world studies; (2) LLM-based abstract re-ranking (DeepSeek-R1) prioritized clinically relevant evidence (e.g., cost-effectiveness analyses, head-to-head trials); (3) TextGrad iterative critique applied gradient-style optimization, refining verdicts by penalizing unsupported statements and rewarding citation alignment; (4) Structured verdict enforcement constrained outputs to categorical judgments paired with PubMed IDs. The approach was tested on claims relating to ulcerative colitis treatments, encompassing efficacy, safety, and cost-effectiveness comparisons.
RESULTS: The hybrid model consistently turned noisy retrieval results into clear, citation-supported decisions. First-round retrieval pulls ~8% irrelevant abstracts; re-ranking and TextGrad refinement diminishes noise, with verdicts falling in close agreement to trial evidence/real-world data. Well-organized outputs led to higher precision and, together with the automatic PubMed referencing, promoted accountability.
CONCLUSIONS: The hybrid pipeline presented here showcases the potential of multi-stage AI architectures to improve soundness, transparency and scalability of evidence-based analysis in HEOR claim studies. RAG and re-ranking, combined with TextGrad critique, produce structured verdicts in support of HTA and payer negotiation. Beyond ulcerative colitis, the model is applicable to oncology, rare diseases and other therapeutic areas.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR175
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
SDC: Systemic Disorders/Conditions (Anesthesia, Auto-Immune Disorders (n.e.c.), Hematological Disorders (non-oncologic), Pain)