VALIDATION OF THE METASLR ROB MODULE: A MULTI-AGENT GENERATIVE AI SYSTEM FOR AUTOMATING COCHRANE RISK OF BIAS 2.0 ASSESSMENTS

Author(s)

Inderpreet S. Marwaha, MSc, RPh¹, Rajdeep Kaur, PhD¹, Ritesh Dubey, PharmD¹, Shubhram Pandey, MSc¹, Barinder Singh, RPh².
¹Pharmacoevidence Pvt. Ltd., SAS Nagar, Mohali, India, ²Pharmacoevidence Pvt. Ltd., SAS Nagar Mohali, India.

Presentation Documents

RoB_ISPOR US 2026.pdf

OBJECTIVES: Risk of Bias (RoB) assessment is a critical yet resource-intensive element of systematic literature reviews (SLRs), often limited by inter-rater consistency. This study validated a Retrieval-Augmented Generation (RAG)-enabled multi-agent Generative AI (GenAI) module within the MetaSLR platform, benchmarking its efficiency, accuracy, reliability, and directional bias against Subject Matter Experts (SMEs) for Cochrane RoB 2.0 assessments.
METHODS: A validation study was conducted using 36 randomized controlled trials drawn from two historical SLRs. The AI system utilized a multi-agent architecture where distinct sub-agents autonomously answered signalling questions (SQs) using a dynamic checklist approach, where answers to initial SQs adapted subsequent logic. Domain-level and overall risk judgements were derived algorithmically from the SQ decisions. Agreement with SME consensus was evaluated using observed agreement, sensitivity, specificity, and F1 score. Additionally, inter-rater reliability (both Cohen’s κ and Gwet’s AC1), directional bias [Δ=mean (AI score)-mean (SME score)], and time-to-completion were analysed.
RESULTS: Strong adherence to decision-logic for SQs was demonstrated with 78.16% agreement (specificity: 91.92%, unweighted κ = 0.637, Gwet’s AC1 = 0.727). For Domains, weighted AC1 (0.776) confirmed high reliability, robustly accounting for the dataset's significant class imbalance. For Overall RoB, the agreement was 55.56% (specificity: 78.06%, weighted AC1 = 0.337). Importantly, directional bias analysis indicated the AI was more conservative than SMEs (Overall RoB Δ=0.19; Domain Δ=0.07), with no evidence of systematic risk underestimation. GenAI reduced total assessment time by 46% (9.1 hours saved, including adjudication), cutting mean per-study time from 15 minutes to 15 seconds.
CONCLUSIONS: The multi-agent RAG-enabled module demonstrates high methodological validity in adhering to Cochrane conditional logic (high specificity). While exact concordance with SMEs was moderate, the GenAI system systematically exhibited a conservative bias in its judgements. These findings support the system's implementation as a reliable, high-fidelity quality appraisal tool for Human-in-the-Loop (HITL) governed evidence synthesis.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR134

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)