Can Generative AI Automate Health Economic Model Verification?

Author(s)

Jag Chhatwal, PhD¹, Sumeyye Samur, PhD², Ismail Fatih Yildirim, MSc³, Jamie Elvidge, BA, MSc⁴, Kusal Lokuge, PhD⁵, Steve Sharp, MSc⁴, Jaykrit Palani, MS³, Akash Ramanarayanan, BS³, Rachael Fleurence, MSc, PhD³, Turgay Ayer, PhD⁶.
¹Harvard Medical School / Massachusetts General Hospital, Boston, MA, USA, ²VP, Head of Value & Access, Value Analytics Labs, Boston, MA, USA, ³Value Analytics Labs, Boston, MA, USA, ⁴National Institute for Health and Care Excellence, Manchester, United Kingdom, ⁵National Institiute for Health and Care Excellence, Manchester, United Kingdom, ⁶Georgia Institute of Technology, Atlanta, GA, USA.

OBJECTIVES: Verification is a critical component of health economic model quality assurance (QA) but remains resource-intensive and under-resourced. We evaluated the feasibility of using generative AI (GenAI) to automate the validation checklist used by the National Institute for Health and Care Excellence (NICE) for internal QA of cost-effectiveness models. We tested this approach on an Excel-based cost-effectiveness model developed by the guidelines programme at the NICE.
METHODS: We applied OpenAI’s O4-Mini large language model (LLM) to automate 13 structured verification tests derived from the checklist, which were designed to systematically identify technical errors, logical inconsistencies, and input/output discrepancies. We evaluated it on the cost-effectiveness model that compared Maintenance and Reliever Therapy versus inhaled corticosteroid/long-acting beta agonist treatment strategies for asthma from the UK NHS perspective over a 5-year horizon. Using Python (xlwings, openpyxl) and LangChain frameworks, model parameters were programmatically modified to simulate null, extreme, and boundary conditions (e.g., adjusting discount rates, utilities, costs, mortality). The LLM was prompted with structured outputs and tasked to assess alignment with predefined expected outcomes, returning pass/fail determinations with diagnostic explanations.
RESULTS: The model previously had passed NICE’s standard QA process. The GenAI system correctly passed 12 of 13 verification tests. For example, setting utilities to zero returned zero quality-adjusted life years (QALYs), and increasing discount rates reduced discounted. The one failed test was due to the age- and sex-based default utility values in the lookup tables not being fully updated by the user, rather than a model error. Once corrected, the LLM passed the test. No hallucinations were observed. The fully automated pipeline executed model verification scenarios rapidly and reproducibly, suggesting efficiency gains over manual verification.
CONCLUSIONS: GenAI-enabled verification shows promise for scalable, transparent, and reproducible QA in health economic modeling. Future work should explore GenAI’s role in automating other validation types recommended by the ISPOR/SMDM guidelines.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

MSR52

Topic

Economic Evaluation, Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)