BENCHMARKING GENERATIVE AI TOOLS FOR COST-EFFECTIVENESS ANALYSIS (CEA) MODEL INPUT CURATION: A COMPARATIVE EVALUATION OF LARGE LANGUAGE MODELS

Author(s)

Sourab Ganna, PharmD1, Yinan Wang, PhD2, Jieni Li, MPH, PhD2, Rajender Aparasu, PhD2;
1University of Houston, Student, Houston, TX, USA, 2University of Houston, Houston, TX, USA
OBJECTIVES: Identifying high-quality parameter inputs for cost-effectiveness analysis (CEA)models is critical but time-consuming. However, the application of large language models (LLMs) in this context remains insufficiently evaluated. We aimed to assess the accuracy, transparency, and reproducibility of three LLMs in sourcing and documenting CEA model parameters from published literature.
METHODS: A structured evaluation of generative artificial intelligence tools (ChatGPT, Claude, Gemini) was conducted to assess their accuracy, transparency, and reproducibility in curating inputs for CEA models. Using a previously developed cost-effectiveness model in major depressive disorder as a reference framework, the ChatGPT, Claude, and Gemini LLMs were used to source five predefined model inputs using standardized prompts. Inputs included probabilities, utilities, and cost estimates relevant to the model’s structure and perspective. Outputs were scored by two reviewers independently using a rubric adapted from ELEVATE-GenAI, PALISADE, and CHEERS-AI across eight domains, including accuracy, transparency, and citation validity, with a max score of 12.
RESULTS: All three LLMs retrieved literature-based parameter estimates relevant to the predefined CEA framework. ChatGPT and Claude achieved the highest overall performance (10/12 points each), demonstrating strong accuracy, contextual relevance, and citation validity across most parameters. Gemini scored 9/12, with comparable relevance and efficiency but more limitations in completeness and reproducibility. Across models, minor performance deficits were driven by incomplete uncertainty reporting and occasional ambiguity in citation traceability rather than incorrect values. All three LLMs provided similar CEA outputs. All tools completed parameter retrieval within the predefined 10-minute efficiency threshold, suggesting meaningful time savings over manual literature search.
CONCLUSIONS: Generative AI tools demonstrate feasible and accurate performance with similar CEA outputs, supporting initial economic model development. While overall accuracy and relevance were high, variability in transparency, completeness, and reproducibility highlights the need for structured prompting and expert verification, particularly for reporting uncertainty and confirming citations.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

P1

Topic

Health Technology Assessment

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×