FAST&FURIOUSHEOR IMPACT:THE AI ASSISTANT PLAYBOOKTO BUILDTRUSTAND RESILIENCYINTHEFAST-PACED WORLD OFLARGE LANGUAGE MODELS (LLMS)
Author(s)
Katelyn Keyloun, BS, MS, PharmD1, Gavin J. Outteridge, MA2, Gabriel Bishop, MS, MA3, Anwar Sabir, BS4, Tyler Reinsch, PharmD5, Justin Yu, PharmD MS6;
1Arysana, Director, Product Innovation & Development, Carson City, NV, USA, 2Arysana, London, United Kingdom, 3Arysana, Palo Alto, CA, USA, 4Arysana, Boston, MA, USA, 5Arysana, Springfield, MO, USA, 6Arysana, Jersey City, NJ, USA
1Arysana, Director, Product Innovation & Development, Carson City, NV, USA, 2Arysana, London, United Kingdom, 3Arysana, Palo Alto, CA, USA, 4Arysana, Boston, MA, USA, 5Arysana, Springfield, MO, USA, 6Arysana, Jersey City, NJ, USA
OBJECTIVES: While broad use of AI Assistants grows with a ‘trust, then verify’ approach, use for HEOR tasks largely warrants verification first. Leading Generative AI (GenAI) and retrieval augmented generation (RAG) validation methods to ‘verify, then trust’, such as human-in-the-loop testing, generally lack task-specificity and do not account for robustness nor estimate the value for HEOR. Thus, the objective was to develop and apply an HEOR AI Assistant validation framework.
METHODS: A targeted review of PubMed and an internet search informed the new framework. The framework was applied to a research-tailored HEOR AI Assistant for evidence summarization using 278 articles, applying pre-and post-processing steps, and using a RAG approach with OpenAI LLM GPT-5. Article processing included: PDF-to-text (Markdown) conversion, semantic text chunking, contextual enrichment, 3072-dimensional embedding using the OpenAI text-embedding-3-large model, and a vector database to facilitate RAG. A test set of 12 queries/answers developed a priori was compared against AI Assistant responses with a passing threshold of ≥90% by human testers. Bias was assessed through 3 queries. A robustness test is planned by comparing LLMs. Yield was estimated from reported minutes saved per query, adjusted for performance.
RESULTS: Three relevant frameworks across 46 articles informed the new GenAI-VERIFY framework, including ELEVATE-GenAI, CHEERS-AI, and DEAL-B checklist. Validation and Evaluation, includes testing overall performance with predefined queries/answers. Robustness, includes comparing LLMs. Integrity and Fairness, includes transparency, reproducibility and bias. Yield, estimates the value for HEOR. Applying GenAI-VERIFY, the AI Assistant performance was 91% (30/33 queries). Testing temporally varied, supporting reproducibility; assessment of bias supported fairness for 3/3 queries. Estimated research time saved was 54.6 minutes per query.
CONCLUSIONS: The new GenAI-VERIFY validation framework can be applied to research-tailored AI Assistants for HEOR tasks to support verification, as well as integrity, fairness and value. Future research includes automated testing approaches for greater scalability.
METHODS: A targeted review of PubMed and an internet search informed the new framework. The framework was applied to a research-tailored HEOR AI Assistant for evidence summarization using 278 articles, applying pre-and post-processing steps, and using a RAG approach with OpenAI LLM GPT-5. Article processing included: PDF-to-text (Markdown) conversion, semantic text chunking, contextual enrichment, 3072-dimensional embedding using the OpenAI text-embedding-3-large model, and a vector database to facilitate RAG. A test set of 12 queries/answers developed a priori was compared against AI Assistant responses with a passing threshold of ≥90% by human testers. Bias was assessed through 3 queries. A robustness test is planned by comparing LLMs. Yield was estimated from reported minutes saved per query, adjusted for performance.
RESULTS: Three relevant frameworks across 46 articles informed the new GenAI-VERIFY framework, including ELEVATE-GenAI, CHEERS-AI, and DEAL-B checklist. Validation and Evaluation, includes testing overall performance with predefined queries/answers. Robustness, includes comparing LLMs. Integrity and Fairness, includes transparency, reproducibility and bias. Yield, estimates the value for HEOR. Applying GenAI-VERIFY, the AI Assistant performance was 91% (30/33 queries). Testing temporally varied, supporting reproducibility; assessment of bias supported fairness for 3/3 queries. Estimated research time saved was 54.6 minutes per query.
CONCLUSIONS: The new GenAI-VERIFY validation framework can be applied to research-tailored AI Assistants for HEOR tasks to support verification, as well as integrity, fairness and value. Future research includes automated testing approaches for greater scalability.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR186
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas