BEYOND THE HYPE: DO EXISTING VALIDATION METHODS AND GOLD STANDARDS PROVE THE VALUE OF GENERATIVE AI APPLICATIONS FOR HEOR?

Author(s)

Tim Disher, BSc, RN, PhD¹, Nicole Ferko, MSc², Kevin Kallmes, JD, MA³;
¹Sandpiper Analytics, West Porters Lake, NS, Canada, ²EVERSANA, Burlington, ON, Canada, ³Nested Knowledge, St. Paul, MN, USA

OBJECTIVES: To evaluate the current state of the art (SOTA) for Generative Artificial Intelligence (GenAI) in Health Economics and Outcomes Research (HEOR) and assess validation rigor and the emerging landscape of tool development.
METHODS: We conducted a targeted review of PubMed, MedArxiv, Arxiv, ISPOR, and commercial websites for GenAI applications. Screening utilized human-in-the-loop tagging; analysis employed GenAI extraction (Gemini Flash Lite 2.5) with human review. We evaluated maturity by application and critically appraised validation methodologies.
RESULTS: A total of 639 sources were included. SOTA included: SLRs: human/superhuman automation of the entire search and analysis pipeline; Medical Writing: automated multi-modal Retrieval-Augmented Generation pipelines for dossiers; Modelling: structure scoping, semi-automated model replication, and automated coding; RWE: conversational design and analysis. Validation quality is strongest for SLRs but varies greatly in depth and breadth. Across domains, studies struggle to identify gold standards and focus on simple topics/tasks. Evidence suggests that specialized vendors report better metrics than independent researchers. Validation is weakest in medical writing, comprising opaque case studies. Most applications focus on human-in-the-loop tools targeting reductions in labor (eg, consistent 60%-80% time savings), though a small number of vendors target human replacement for complex tasks. Nearly all sources are based on outdated non-reasoning models.
CONCLUSIONS: GenAI applications exist for nearly all HEOR applications, but validation rarely reflects real-world project complexity. There is substantial uncertainty regarding reference standards and likely over-estimation of labor-saving. No evidence currently evaluates GenAI’s reasoning on complex strategic questions. Current applications risk increasing complexity without improving quality or potentially leading to difficult-to-trace failure modes. Tool development is significantly outpaced by foundation model improvements, suggesting greater success is possible with modern models. Future publications should prioritize capturing real-world complexity, explore model decisions where multiple defensible options exist, align on gold standards, and identify applications where GenAI improves outcomes rather than solely reduce costs.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR93

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)