Evaluating the Application of GPT-4o and Retrieval-Augmented Generation (RAG) for Assessing Risk of Bias and Study Quality in Systematic Literature Reviews (SLRs): Preliminary Findings from a Comparative Study

Moderator

Mir Sohail Fazeli, PhD, MD, Evidinno Outcomes Research Inc., Vancouver, BC, Canada

Speakers

Ellen Kasireddy; Cuthbert Chow, Other; Mir-Masoud Pourrahmat, Evidinno Outcomes Research Inc., Vancouver, BC, Canada; Jean-Paul Collet, PhD, MD, Evidinno Outcomes Research Inc, Vancouver, BC, Canada

OBJECTIVES: To assess the performance of GPT-4o and RAG in risk of bias assessment for SLRs.
METHODS: A custom model integrating OpenAI’s GPT-4o, a large language model, with RAG capabilities was developed for quality assessment in SLRs. The model employed a two-stage approach: (1) vector store implementation for RAG, and (2) integration with OpenAI Assistants. Performance was assessed for all items included in the Cochrane Risk of Bias Version 2 (ROB2) for randomized controlled trials, the JBI tool for cross-sectional studies, and the Newcastle-Ottawa Scale (NOS) for cohort studies, using 10 studies each. Error analysis included true positives (both human and model mark satisfactory), true negatives (both mark unsatisfactory), false positives (model marks satisfactory, human does not), and false negatives (model marks unsatisfactory, human marks satisfactory). Metrics included accuracy (ratio of correct predictions to total predictions), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).
RESULTS: The ROB2 tool’s accuracy was 68.1%, sensitivity 67.0%, specificity 73.1%, PPV 91.9%, and NPV 32.8%. The model effectively predicted low-risk-of-bias items and minimized misclassification of these items as high risk. The JBI tool's accuracy was 72.5%, sensitivity 62.5%, specificity 87.5%, PPV 88.2%, and NPV 60.9%. The model showed high specificity in identifying high risk of bias but was less effective in low-risk-of-bias predictions. The NOS tool's accuracy was 56.7%, sensitivity 33.3%, specificity 83.3%, PPV 69.6%, and NPV 52.2%. The model demonstrated high specificity but low sensitivity and a high false-negative rate for low-risk-of-bias items.
CONCLUSIONS: The model demonstrated high specificity for identifying high risk of bias across tools, but showed limited sensitivity, especially with NOS, and low NPV with ROB2, indicating a high false-negative rate. While currently useful alongside human oversight, further optimizations is needed to prioritize improving sensitivity for optimal identification of low risk of bias and to enhance overall performance.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR22

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)