GENERAL-PURPOSE VERSUS DOMAIN-SPECIFIC AI FOR SYSTEMATIC CONFOUNDER IDENTIFICATION IN MULTIPLE SCLEROSIS: A COMPARATIVE METHODOLOGICAL STUDY USING IQWIG ASSESSMENTS AS GROUND TRUTH
Author(s)
Anton O. Wiehe1, Florian Woeste, MSc.2, Pia Ana Cuk, MSc3.
1Head of AI, Pharos Labs GmbH, Hamburg, Germany, 2PHAROS Labs, Ahrensburg, Germany, 3PHAROS Labs GmbH, Hamburg, Germany.
1Head of AI, Pharos Labs GmbH, Hamburg, Germany, 2PHAROS Labs, Ahrensburg, Germany, 3PHAROS Labs GmbH, Hamburg, Germany.
OBJECTIVES: Systematic confounder identification is a mandatory step in HTA benefit assessments (e.g., for IQWiG in Germany), traditionally requiring extensive manual labor. While frontier Large Language Models (LLMs) offer efficiency, their reliability in regulatory contexts remains unproven. This study compares the performance of state-of-the-art general-purpose LLMs versus a domain-specific Regulatory Retrieval-Augmented Generation (RAG) system in identifying confounders for Relapsing-Remitting Multiple Sclerosis (RRMS).
METHODS: We used the IQWiG working paper GA23-02 as the ground truth, which defined 28 distinct consolidated confounders (derived from 160 initial variables) for RRMS therapies. We queried two general-purpose frontier models (Claude Opus 4.5, Gemini 3) and one domain-optimized RAG system (Regulaido) to generate confounder lists for a target trial emulation of dimethyl fumarate vs. glatiramer acetate. Performance was evaluated based on Recall (identification of the 28 ground-truth variables), Precision (avoidance of excluded variables), and Hallucination Rate (fabrication of references).
RESULTS: The IQWiG ground truth established 28 specific confounders. General-purpose LLMs achieved an average Recall of 66%. While Claude Opus 4.5 achieved high recall (89%), it suffered from low precision (43%), erroneously including variables explicitly excluded by IQWiG (e.g., Insurance Status, Time since last relapse) due to lack of specific regulatory context. Gemini 3 failed to identify critical biomarkers, achieving only 43% Recall. Crucially, general models exhibited a citation hallucination rate of 22%, frequently inventing study titles or attributing findings to incorrect journals. The domain-specific RAG system achieved a Recall of 89% with 100% Precision regarding IQWiG exclusion criteria and 0% citation hallucination, as it was constrained to verifiable regulatory databases.
CONCLUSIONS: Even frontier models (Claude Opus 4.5) lack the precision and citation integrity required for regulatory HTA submissions. While capable of generating extensive lists, they fail to adhere to specific exclusion criteria and fabricate evidence. Domain-optimized RAG systems significantly outperform general models in precision and evidentiary support.
METHODS: We used the IQWiG working paper GA23-02 as the ground truth, which defined 28 distinct consolidated confounders (derived from 160 initial variables) for RRMS therapies. We queried two general-purpose frontier models (Claude Opus 4.5, Gemini 3) and one domain-optimized RAG system (Regulaido) to generate confounder lists for a target trial emulation of dimethyl fumarate vs. glatiramer acetate. Performance was evaluated based on Recall (identification of the 28 ground-truth variables), Precision (avoidance of excluded variables), and Hallucination Rate (fabrication of references).
RESULTS: The IQWiG ground truth established 28 specific confounders. General-purpose LLMs achieved an average Recall of 66%. While Claude Opus 4.5 achieved high recall (89%), it suffered from low precision (43%), erroneously including variables explicitly excluded by IQWiG (e.g., Insurance Status, Time since last relapse) due to lack of specific regulatory context. Gemini 3 failed to identify critical biomarkers, achieving only 43% Recall. Crucially, general models exhibited a citation hallucination rate of 22%, frequently inventing study titles or attributing findings to incorrect journals. The domain-specific RAG system achieved a Recall of 89% with 100% Precision regarding IQWiG exclusion criteria and 0% citation hallucination, as it was constrained to verifiable regulatory databases.
CONCLUSIONS: Even frontier models (Claude Opus 4.5) lack the precision and citation integrity required for regulatory HTA submissions. While capable of generating extensive lists, they fail to adhere to specific exclusion criteria and fabricate evidence. Domain-optimized RAG systems significantly outperform general models in precision and evidentiary support.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR212
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
SDC: Neurological Disorders