Can Generative AI Automate Health Economic Model Verification?
Author(s)
Jag Chhatwal, PhD1, Sumeyye Samur, PhD2, Ismail Fatih Yildirim, MSc3, Jamie Elvidge, BA, MSc4, Kusal Lokuge, PhD5, Steve Sharp, MSc4, Jaykrit Palani, MS3, Akash Ramanarayanan, BS3, Rachael Fleurence, MSc, PhD3, Turgay Ayer, PhD6.
1Harvard Medical School / Massachusetts General Hospital, Boston, MA, USA, 2VP, Head of Value & Access, Value Analytics Labs, Boston, MA, USA, 3Value Analytics Labs, Boston, MA, USA, 4National Institute for Health and Care Excellence, Manchester, United Kingdom, 5National Institiute for Health and Care Excellence, Manchester, United Kingdom, 6Georgia Institute of Technology, Atlanta, GA, USA.
1Harvard Medical School / Massachusetts General Hospital, Boston, MA, USA, 2VP, Head of Value & Access, Value Analytics Labs, Boston, MA, USA, 3Value Analytics Labs, Boston, MA, USA, 4National Institute for Health and Care Excellence, Manchester, United Kingdom, 5National Institiute for Health and Care Excellence, Manchester, United Kingdom, 6Georgia Institute of Technology, Atlanta, GA, USA.
OBJECTIVES: Verification is a critical component of health economic model quality assurance (QA) but remains resource-intensive and under-resourced. We evaluated the feasibility of using generative AI (GenAI) to automate the validation checklist used by the National Institute for Health and Care Excellence (NICE) for internal QA of cost-effectiveness models. We tested this approach on an Excel-based cost-effectiveness model developed by the guidelines programme at the NICE.
METHODS: We applied OpenAI’s O4-Mini large language model (LLM) to automate 13 structured verification tests derived from the checklist, which were designed to systematically identify technical errors, logical inconsistencies, and input/output discrepancies. We evaluated it on the cost-effectiveness model that compared Maintenance and Reliever Therapy versus inhaled corticosteroid/long-acting beta agonist treatment strategies for asthma from the UK NHS perspective over a 5-year horizon. Using Python (xlwings, openpyxl) and LangChain frameworks, model parameters were programmatically modified to simulate null, extreme, and boundary conditions (e.g., adjusting discount rates, utilities, costs, mortality). The LLM was prompted with structured outputs and tasked to assess alignment with predefined expected outcomes, returning pass/fail determinations with diagnostic explanations.
RESULTS: The model previously had passed NICE’s standard QA process. The GenAI system correctly passed 12 of 13 verification tests. For example, setting utilities to zero returned zero quality-adjusted life years (QALYs), and increasing discount rates reduced discounted. The one failed test was due to the age- and sex-based default utility values in the lookup tables not being fully updated by the user, rather than a model error. Once corrected, the LLM passed the test. No hallucinations were observed. The fully automated pipeline executed model verification scenarios rapidly and reproducibly, suggesting efficiency gains over manual verification.
CONCLUSIONS: GenAI-enabled verification shows promise for scalable, transparent, and reproducible QA in health economic modeling. Future work should explore GenAI’s role in automating other validation types recommended by the ISPOR/SMDM guidelines.
METHODS: We applied OpenAI’s O4-Mini large language model (LLM) to automate 13 structured verification tests derived from the checklist, which were designed to systematically identify technical errors, logical inconsistencies, and input/output discrepancies. We evaluated it on the cost-effectiveness model that compared Maintenance and Reliever Therapy versus inhaled corticosteroid/long-acting beta agonist treatment strategies for asthma from the UK NHS perspective over a 5-year horizon. Using Python (xlwings, openpyxl) and LangChain frameworks, model parameters were programmatically modified to simulate null, extreme, and boundary conditions (e.g., adjusting discount rates, utilities, costs, mortality). The LLM was prompted with structured outputs and tasked to assess alignment with predefined expected outcomes, returning pass/fail determinations with diagnostic explanations.
RESULTS: The model previously had passed NICE’s standard QA process. The GenAI system correctly passed 12 of 13 verification tests. For example, setting utilities to zero returned zero quality-adjusted life years (QALYs), and increasing discount rates reduced discounted. The one failed test was due to the age- and sex-based default utility values in the lookup tables not being fully updated by the user, rather than a model error. Once corrected, the LLM passed the test. No hallucinations were observed. The fully automated pipeline executed model verification scenarios rapidly and reproducibly, suggesting efficiency gains over manual verification.
CONCLUSIONS: GenAI-enabled verification shows promise for scalable, transparent, and reproducible QA in health economic modeling. Future work should explore GenAI’s role in automating other validation types recommended by the ISPOR/SMDM guidelines.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR52
Topic
Economic Evaluation, Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas