EVALUATING AI MODELS FOR DRUG PRICING ANALYTICS: A COMPARATIVE STUDY FOR MOST-FAVORED-NATION POLICY MODELING

Author(s)

Mark Dranias, PhD1, Rustam Shariq Mujtaba, BSc(Pharm)2, Walter Sze Tung Lam, MBBS, MS3, Cloe Ying Chee Koh, MS, BSc(Pharm)1.
1AureusIQ LLC, Mills River, NC, USA, 2AureusIQ LLC, Singapore, Singapore, 3Department of Occupational and Environmental Medicine, Singapore General Hospital, Singapore, Singapore.
OBJECTIVES: Extracting accurate and reliable drug pricing data from global markets remains a challenge for Most-Favored-Nation (MFN) pricing models. While Large Language Models (LLMs) show promise, their comparative performance across pricing domains is underexplored. This evidence-based study applies a decision-grade evaluation framework to compare frontier models, GPT-5 and Claude 4 Opus, across multiple prompting strategies and key quality dimensions, to determine how model choice and prompt structure jointly influence accuracy, consistency, and policy-relevant performance for MFN pricing analytics, with broader implications for model evaluation in high-stakes decision-making.
METHODS: A multi-model validation study evaluated three high-value drugs across six countries (U.S., U.K., France, Canada, Germany, Japan). Each drug-country pair was queried with three prompt types: ambiguous, structured schema-constrained, and structured with MFN policy context (n=108). Structured outputs (n=72) were validated via a Python-based pipeline, while ambiguous outputs (n=36) underwent dual blinded review, with moderate inter-rater agreement (Cohen’s κ = 0.638, p < 0.001). All outputs were scored on a 0-3 rubric (price accuracy, unit correctness, citation traceability) against human curated ground truth. Significance was computed using McNemar’s test for binary variables and Wilcoxon tests for numerical scores.
RESULTS: Structured prompts yielded substantial performance gains over ambiguous prompts, increasing citation accuracy by 58% (p = 0.014), currency unit accuracy by 28.5% (p = 0.0002), and overall quality scores by 37.5% (p = 0.002). With structured prompting, performance exceeded policy-usability thresholds (composite score ≈2.11). Model comparisons showed GPT increased citation scores by 50% over Claude (p = 0.05).
CONCLUSIONS: Structured, schema-constrained prompting significantly enhances LLM performance for MFN pricing analytics. GPT-5 outperforms Claude 4 Opus, particularly in citation traceability, providing more accurate and reliable outputs for decision-grade modeling. This study highlights the growing need for robust model evaluation services across healthcare AI applications, offering actionable insights for organizations seeking reliable, AI-driven solutions.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR86

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×