What Is the Value of Large Language Models for Improving Extraction and Interpretation of Efficiency Data From French Commission Dévaluation Économique et de Santé Publique (CEESP): Decisions on Breast Cancer Medications?
Author(s)
Ludovic Lamarsalle, MSc, PharmD1, Martin PRODEL, PhD2, STEPHANE ROZE, MSc3.
1Founder, HEALSTRA, Lyon, France, 2DALI, LYON, France, 3Homax Advisory, POMMIERS, France.
1Founder, HEALSTRA, Lyon, France, 2DALI, LYON, France, 3Homax Advisory, POMMIERS, France.
OBJECTIVES: Data insight of health‑economic evidence is pivotal to understand reimbursement drivers. The economic committee (CEESP) of the French Haute Autorité de Santé conducts cost‑effectiveness assessments for products that claim added clinical benefit (ASMR I-III) and are expected to exceed €20 million in annual sales. We assessed whether state‑of‑the‑art Large Language Models (LLMs) can perform data extraction from recent CEESP breast‑cancer opinions.
METHODS: Sixteen opinions released between March 2014 and April 2024 were analyzed with three LLMs (ChatGPT‑4o, Claude-3-5-Sonnet, Mistral‑Large 24.11) using a unified prompt covering 65 data fields: 23 administrative and 42 health‑economic variables (e.g., incremental cost‑effectiveness ratio, modeling assumptions, utilities⋯). LLM outputs were compared pair‑wise and against a manually curated reference created by senior health‑economists.
RESULTS: The analysis of 16 CEESP opinions revealed two performance patterns. Inter-LLM similarity analysis showed moderate concordance: perfect agreement among all three LLMs was achieved in 33.8% of fields, with partial agreement (≥2 LLMs) in 70.2% of cases. Structured administrative fields demonstrated high inter-model consistency: intervention type (100% agreement), therapeutic area (100%), and regulatory parameters (>90% agreement). However, when compared against expert-validated reference data, overall performance was lower. ChatGPT4o achieved the highest similarity score (0.550), followed by Claude-3.5-Sonnet (0.537) and Mistral-Large (0.519). Important discrepancies emerged in medico-economic fields: target population extraction, cost-effectiveness ratio values, and efficiency condition. Only 31% of fields achieved excellent reference concordance (≥90%), while 52.4% showed moderate performance (<70%). Complex numerical data and free-text clinical descriptions posed the greatest challenges, indicating that LLM consensus does not guarantee accuracy against expert validation.
CONCLUSIONS: LLMs demonstrate promising potential for automating health economic data extraction from regulatory documents. However, limitations persist for complex medico-economic parameters requiring specialized domain knowledge. A multi-LLM consensus approach with automated disagreement detection could trigger expert review only for discordant results, significantly reducing manual workload while maintaining quality standards.
METHODS: Sixteen opinions released between March 2014 and April 2024 were analyzed with three LLMs (ChatGPT‑4o, Claude-3-5-Sonnet, Mistral‑Large 24.11) using a unified prompt covering 65 data fields: 23 administrative and 42 health‑economic variables (e.g., incremental cost‑effectiveness ratio, modeling assumptions, utilities⋯). LLM outputs were compared pair‑wise and against a manually curated reference created by senior health‑economists.
RESULTS: The analysis of 16 CEESP opinions revealed two performance patterns. Inter-LLM similarity analysis showed moderate concordance: perfect agreement among all three LLMs was achieved in 33.8% of fields, with partial agreement (≥2 LLMs) in 70.2% of cases. Structured administrative fields demonstrated high inter-model consistency: intervention type (100% agreement), therapeutic area (100%), and regulatory parameters (>90% agreement). However, when compared against expert-validated reference data, overall performance was lower. ChatGPT4o achieved the highest similarity score (0.550), followed by Claude-3.5-Sonnet (0.537) and Mistral-Large (0.519). Important discrepancies emerged in medico-economic fields: target population extraction, cost-effectiveness ratio values, and efficiency condition. Only 31% of fields achieved excellent reference concordance (≥90%), while 52.4% showed moderate performance (<70%). Complex numerical data and free-text clinical descriptions posed the greatest challenges, indicating that LLM consensus does not guarantee accuracy against expert validation.
CONCLUSIONS: LLMs demonstrate promising potential for automating health economic data extraction from regulatory documents. However, limitations persist for complex medico-economic parameters requiring specialized domain knowledge. A multi-LLM consensus approach with automated disagreement detection could trigger expert review only for discordant results, significantly reducing manual workload while maintaining quality standards.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR221
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
Oncology