VALIDATION OF THE METASLR ROB MODULE: A MULTI-AGENT GENERATIVE AI SYSTEM FOR AUTOMATING COCHRANE RISK OF BIAS 2.0 ASSESSMENTS
Author(s)
Inderpreet S. Marwaha, MSc, RPh1, Rajdeep Kaur, PhD1, Ritesh Dubey, PharmD1, Shubhram Pandey, MSc1, Barinder Singh, RPh2.
1Pharmacoevidence Pvt. Ltd., SAS Nagar, Mohali, India, 2Pharmacoevidence Pvt. Ltd., SAS Nagar Mohali, India.
1Pharmacoevidence Pvt. Ltd., SAS Nagar, Mohali, India, 2Pharmacoevidence Pvt. Ltd., SAS Nagar Mohali, India.
OBJECTIVES: Risk of Bias (RoB) assessment is a critical yet resource-intensive element of systematic literature reviews (SLRs), often limited by inter-rater consistency. This study validated a Retrieval-Augmented Generation (RAG)-enabled multi-agent Generative AI (GenAI) module within the MetaSLR platform, benchmarking its efficiency, accuracy, reliability, and directional bias against Subject Matter Experts (SMEs) for Cochrane RoB 2.0 assessments.
METHODS: A validation study was conducted using 36 randomized controlled trials drawn from two historical SLRs. The AI system utilized a multi-agent architecture where distinct sub-agents autonomously answered signalling questions (SQs) using a dynamic checklist approach, where answers to initial SQs adapted subsequent logic. Domain-level and overall risk judgements were derived algorithmically from the SQ decisions. Agreement with SME consensus was evaluated using observed agreement, sensitivity, specificity, and F1 score. Additionally, inter-rater reliability (both Cohen’s κ and Gwet’s AC1), directional bias [Δ=mean (AI score)-mean (SME score)], and time-to-completion were analysed.
RESULTS: Strong adherence to decision-logic for SQs was demonstrated with 78.16% agreement (specificity: 91.92%, unweighted κ = 0.637, Gwet’s AC1 = 0.727). For Domains, weighted AC1 (0.776) confirmed high reliability, robustly accounting for the dataset's significant class imbalance. For Overall RoB, the agreement was 55.56% (specificity: 78.06%, weighted AC1 = 0.337). Importantly, directional bias analysis indicated the AI was more conservative than SMEs (Overall RoB Δ=0.19; Domain Δ=0.07), with no evidence of systematic risk underestimation. GenAI reduced total assessment time by 46% (9.1 hours saved, including adjudication), cutting mean per-study time from 15 minutes to 15 seconds.
CONCLUSIONS: The multi-agent RAG-enabled module demonstrates high methodological validity in adhering to Cochrane conditional logic (high specificity). While exact concordance with SMEs was moderate, the GenAI system systematically exhibited a conservative bias in its judgements. These findings support the system's implementation as a reliable, high-fidelity quality appraisal tool for Human-in-the-Loop (HITL) governed evidence synthesis.
METHODS: A validation study was conducted using 36 randomized controlled trials drawn from two historical SLRs. The AI system utilized a multi-agent architecture where distinct sub-agents autonomously answered signalling questions (SQs) using a dynamic checklist approach, where answers to initial SQs adapted subsequent logic. Domain-level and overall risk judgements were derived algorithmically from the SQ decisions. Agreement with SME consensus was evaluated using observed agreement, sensitivity, specificity, and F1 score. Additionally, inter-rater reliability (both Cohen’s κ and Gwet’s AC1), directional bias [Δ=mean (AI score)-mean (SME score)], and time-to-completion were analysed.
RESULTS: Strong adherence to decision-logic for SQs was demonstrated with 78.16% agreement (specificity: 91.92%, unweighted κ = 0.637, Gwet’s AC1 = 0.727). For Domains, weighted AC1 (0.776) confirmed high reliability, robustly accounting for the dataset's significant class imbalance. For Overall RoB, the agreement was 55.56% (specificity: 78.06%, weighted AC1 = 0.337). Importantly, directional bias analysis indicated the AI was more conservative than SMEs (Overall RoB Δ=0.19; Domain Δ=0.07), with no evidence of systematic risk underestimation. GenAI reduced total assessment time by 46% (9.1 hours saved, including adjudication), cutting mean per-study time from 15 minutes to 15 seconds.
CONCLUSIONS: The multi-agent RAG-enabled module demonstrates high methodological validity in adhering to Cochrane conditional logic (high specificity). While exact concordance with SMEs was moderate, the GenAI system systematically exhibited a conservative bias in its judgements. These findings support the system's implementation as a reliable, high-fidelity quality appraisal tool for Human-in-the-Loop (HITL) governed evidence synthesis.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR134
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas