EVALUATING AI MODELS FOR DRUG PRICING ANALYTICS: A COMPARATIVE STUDY FOR MOST-FAVORED-NATION POLICY MODELING
Author(s)
Mark Dranias, PhD1, Rustam Shariq Mujtaba, BSc(Pharm)2, Walter Sze Tung Lam, MBBS, MS3, Cloe Ying Chee Koh, MS, BSc(Pharm)1.
1AureusIQ LLC, Mills River, NC, USA, 2AureusIQ LLC, Singapore, Singapore, 3Department of Occupational and Environmental Medicine, Singapore General Hospital, Singapore, Singapore.
1AureusIQ LLC, Mills River, NC, USA, 2AureusIQ LLC, Singapore, Singapore, 3Department of Occupational and Environmental Medicine, Singapore General Hospital, Singapore, Singapore.
OBJECTIVES: Extracting accurate and reliable drug pricing data from global markets remains a challenge for Most-Favored-Nation (MFN) pricing models. While Large Language Models (LLMs) show promise, their comparative performance across pricing domains is underexplored. This evidence-based study applies a decision-grade evaluation framework to compare frontier models, GPT-5 and Claude 4 Opus, across multiple prompting strategies and key quality dimensions, to determine how model choice and prompt structure jointly influence accuracy, consistency, and policy-relevant performance for MFN pricing analytics, with broader implications for model evaluation in high-stakes decision-making.
METHODS: A multi-model validation study evaluated three high-value drugs across six countries (U.S., U.K., France, Canada, Germany, Japan). Each drug-country pair was queried with three prompt types: ambiguous, structured schema-constrained, and structured with MFN policy context (n=108). Structured outputs (n=72) were validated via a Python-based pipeline, while ambiguous outputs (n=36) underwent dual blinded review, with moderate inter-rater agreement (Cohen’s κ = 0.638, p < 0.001). All outputs were scored on a 0-3 rubric (price accuracy, unit correctness, citation traceability) against human curated ground truth. Significance was computed using McNemar’s test for binary variables and Wilcoxon tests for numerical scores.
RESULTS: Structured prompts yielded substantial performance gains over ambiguous prompts, increasing citation accuracy by 58% (p = 0.014), currency unit accuracy by 28.5% (p = 0.0002), and overall quality scores by 37.5% (p = 0.002). With structured prompting, performance exceeded policy-usability thresholds (composite score ≈2.11). Model comparisons showed GPT increased citation scores by 50% over Claude (p = 0.05).
CONCLUSIONS: Structured, schema-constrained prompting significantly enhances LLM performance for MFN pricing analytics. GPT-5 outperforms Claude 4 Opus, particularly in citation traceability, providing more accurate and reliable outputs for decision-grade modeling. This study highlights the growing need for robust model evaluation services across healthcare AI applications, offering actionable insights for organizations seeking reliable, AI-driven solutions.
METHODS: A multi-model validation study evaluated three high-value drugs across six countries (U.S., U.K., France, Canada, Germany, Japan). Each drug-country pair was queried with three prompt types: ambiguous, structured schema-constrained, and structured with MFN policy context (n=108). Structured outputs (n=72) were validated via a Python-based pipeline, while ambiguous outputs (n=36) underwent dual blinded review, with moderate inter-rater agreement (Cohen’s κ = 0.638, p < 0.001). All outputs were scored on a 0-3 rubric (price accuracy, unit correctness, citation traceability) against human curated ground truth. Significance was computed using McNemar’s test for binary variables and Wilcoxon tests for numerical scores.
RESULTS: Structured prompts yielded substantial performance gains over ambiguous prompts, increasing citation accuracy by 58% (p = 0.014), currency unit accuracy by 28.5% (p = 0.0002), and overall quality scores by 37.5% (p = 0.002). With structured prompting, performance exceeded policy-usability thresholds (composite score ≈2.11). Model comparisons showed GPT increased citation scores by 50% over Claude (p = 0.05).
CONCLUSIONS: Structured, schema-constrained prompting significantly enhances LLM performance for MFN pricing analytics. GPT-5 outperforms Claude 4 Opus, particularly in citation traceability, providing more accurate and reliable outputs for decision-grade modeling. This study highlights the growing need for robust model evaluation services across healthcare AI applications, offering actionable insights for organizations seeking reliable, AI-driven solutions.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR86
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas