EVALUATING AI MODELS FOR DRUG PRICING ANALYTICS: A COMPARATIVE STUDY FOR MOST-FAVORED-NATION POLICY MODELING

Author(s)

Mark Dranias, PhD¹, Rustam Shariq Mujtaba, BSc(Pharm)², Walter Sze Tung Lam, MBBS, MS³, Cloe Ying Chee Koh, MS, BSc(Pharm)¹.
¹AureusIQ LLC, Mills River, NC, USA, ²AureusIQ LLC, Singapore, Singapore, ³Department of Occupational and Environmental Medicine, Singapore General Hospital, Singapore, Singapore.

OBJECTIVES: Extracting accurate and reliable drug pricing data from global markets remains a challenge for Most-Favored-Nation (MFN) pricing models. While Large Language Models (LLMs) show promise, their comparative performance across pricing domains is underexplored. This evidence-based study applies a decision-grade evaluation framework to compare frontier models, GPT-5 and Claude 4 Opus, across multiple prompting strategies and key quality dimensions, to determine how model choice and prompt structure jointly influence accuracy, consistency, and policy-relevant performance for MFN pricing analytics, with broader implications for model evaluation in high-stakes decision-making.
METHODS: A multi-model validation study evaluated three high-value drugs across six countries (U.S., U.K., France, Canada, Germany, Japan). Each drug-country pair was queried with three prompt types: ambiguous, structured schema-constrained, and structured with MFN policy context (n=108). Structured outputs (n=72) were validated via a Python-based pipeline, while ambiguous outputs (n=36) underwent dual blinded review, with moderate inter-rater agreement (Cohen’s κ = 0.638, p < 0.001). All outputs were scored on a 0-3 rubric (price accuracy, unit correctness, citation traceability) against human curated ground truth. Significance was computed using McNemar’s test for binary variables and Wilcoxon tests for numerical scores.
RESULTS: Structured prompts yielded substantial performance gains over ambiguous prompts, increasing citation accuracy by 58% (p = 0.014), currency unit accuracy by 28.5% (p = 0.0002), and overall quality scores by 37.5% (p = 0.002). With structured prompting, performance exceeded policy-usability thresholds (composite score ≈2.11). Model comparisons showed GPT increased citation scores by 50% over Claude (p = 0.05).
CONCLUSIONS: Structured, schema-constrained prompting significantly enhances LLM performance for MFN pricing analytics. GPT-5 outperforms Claude 4 Opus, particularly in citation traceability, providing more accurate and reliable outputs for decision-grade modeling. This study highlights the growing need for robust model evaluation services across healthcare AI applications, offering actionable insights for organizations seeking reliable, AI-driven solutions.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR86

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)