FAST&FURIOUSHEOR IMPACT:THE AI ASSISTANT PLAYBOOKTO BUILDTRUSTAND RESILIENCYINTHEFAST-PACED WORLD OFLARGE LANGUAGE MODELS (LLMS)

Author(s)

Katelyn Keyloun, BS, MS, PharmD¹, Gavin J. Outteridge, MA², Gabriel Bishop, MS, MA³, Anwar Sabir, BS⁴, Tyler Reinsch, PharmD⁵, Justin Yu, PharmD MS⁶;
¹Arysana, Director, Product Innovation & Development, Carson City, NV, USA, ²Arysana, London, United Kingdom, ³Arysana, Palo Alto, CA, USA, ⁴Arysana, Boston, MA, USA, ⁵Arysana, Springfield, MO, USA, ⁶Arysana, Jersey City, NJ, USA

OBJECTIVES: While broad use of AI Assistants grows with a ‘trust, then verify’ approach, use for HEOR tasks largely warrants verification first. Leading Generative AI (GenAI) and retrieval augmented generation (RAG) validation methods to ‘verify, then trust’, such as human-in-the-loop testing, generally lack task-specificity and do not account for robustness nor estimate the value for HEOR. Thus, the objective was to develop and apply an HEOR AI Assistant validation framework.
METHODS: A targeted review of PubMed and an internet search informed the new framework. The framework was applied to a research-tailored HEOR AI Assistant for evidence summarization using 278 articles, applying pre-and post-processing steps, and using a RAG approach with OpenAI LLM GPT-5. Article processing included: PDF-to-text (Markdown) conversion, semantic text chunking, contextual enrichment, 3072-dimensional embedding using the OpenAI text-embedding-3-large model, and a vector database to facilitate RAG. A test set of 12 queries/answers developed a priori was compared against AI Assistant responses with a passing threshold of ≥90% by human testers. Bias was assessed through 3 queries. A robustness test is planned by comparing LLMs. Yield was estimated from reported minutes saved per query, adjusted for performance.
RESULTS: Three relevant frameworks across 46 articles informed the new GenAI-VERIFY framework, including ELEVATE-GenAI, CHEERS-AI, and DEAL-B checklist. Validation and Evaluation, includes testing overall performance with predefined queries/answers. Robustness, includes comparing LLMs. Integrity and Fairness, includes transparency, reproducibility and bias. Yield, estimates the value for HEOR. Applying GenAI-VERIFY, the AI Assistant performance was 91% (30/33 queries). Testing temporally varied, supporting reproducibility; assessment of bias supported fairness for 3/3 queries. Estimated research time saved was 54.6 minutes per query.
CONCLUSIONS: The new GenAI-VERIFY validation framework can be applied to research-tailored AI Assistants for HEOR tasks to support verification, as well as integrity, fairness and value. Future research includes automated testing approaches for greater scalability.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR186

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)