Evaluating the Application of GPT-4o and Retrieval-Augmented Generation (RAG) for Assessing Risk of Bias and Study Quality in Systematic Literature Reviews (SLRs): Preliminary Findings from a Comparative Study
Author(s)
Ellen Kasireddy, MHSc, Cuthbert Chow, MDS, Mir-Masoud Pourrahmat, MSc, Jean-Paul Collet, MD, PhD, Mir Sohail Fazeli, MD, PhD;
Evidinno Outcomes Research Inc, Vancouver, BC, Canada
Evidinno Outcomes Research Inc, Vancouver, BC, Canada
Presentation Documents
OBJECTIVES: To assess the performance of GPT-4o and RAG in risk of bias assessment for SLRs.
METHODS: A custom model integrating OpenAI’s GPT-4o, a large language model, with RAG capabilities was developed for quality assessment in SLRs. The model employed a two-stage approach: (1) vector store implementation for RAG, and (2) integration with OpenAI Assistants. Performance was assessed for all items included in the Cochrane Risk of Bias Version 2 (ROB2) for randomized controlled trials, the JBI tool for cross-sectional studies, and the Newcastle-Ottawa Scale (NOS) for cohort studies, using 10 studies each. Error analysis included true positives (both human and model mark satisfactory), true negatives (both mark unsatisfactory), false positives (model marks satisfactory, human does not), and false negatives (model marks unsatisfactory, human marks satisfactory). Metrics included accuracy (ratio of correct predictions to total predictions), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).
RESULTS: The ROB2 tool’s accuracy was 68.1%, sensitivity 67.0%, specificity 73.1%, PPV 91.9%, and NPV 32.8%. The model effectively predicted low-risk-of-bias items and minimized misclassification of these items as high risk. The JBI tool's accuracy was 72.5%, sensitivity 62.5%, specificity 87.5%, PPV 88.2%, and NPV 60.9%. The model showed high specificity in identifying high risk of bias but was less effective in low-risk-of-bias predictions. The NOS tool's accuracy was 56.7%, sensitivity 33.3%, specificity 83.3%, PPV 69.6%, and NPV 52.2%. The model demonstrated high specificity but low sensitivity and a high false-negative rate for low-risk-of-bias items.
CONCLUSIONS: The model demonstrated high specificity for identifying high risk of bias across tools, but showed limited sensitivity, especially with NOS, and low NPV with ROB2, indicating a high false-negative rate. While currently useful alongside human oversight, further optimizations is needed to prioritize improving sensitivity for optimal identification of low risk of bias and to enhance overall performance.
METHODS: A custom model integrating OpenAI’s GPT-4o, a large language model, with RAG capabilities was developed for quality assessment in SLRs. The model employed a two-stage approach: (1) vector store implementation for RAG, and (2) integration with OpenAI Assistants. Performance was assessed for all items included in the Cochrane Risk of Bias Version 2 (ROB2) for randomized controlled trials, the JBI tool for cross-sectional studies, and the Newcastle-Ottawa Scale (NOS) for cohort studies, using 10 studies each. Error analysis included true positives (both human and model mark satisfactory), true negatives (both mark unsatisfactory), false positives (model marks satisfactory, human does not), and false negatives (model marks unsatisfactory, human marks satisfactory). Metrics included accuracy (ratio of correct predictions to total predictions), sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).
RESULTS: The ROB2 tool’s accuracy was 68.1%, sensitivity 67.0%, specificity 73.1%, PPV 91.9%, and NPV 32.8%. The model effectively predicted low-risk-of-bias items and minimized misclassification of these items as high risk. The JBI tool's accuracy was 72.5%, sensitivity 62.5%, specificity 87.5%, PPV 88.2%, and NPV 60.9%. The model showed high specificity in identifying high risk of bias but was less effective in low-risk-of-bias predictions. The NOS tool's accuracy was 56.7%, sensitivity 33.3%, specificity 83.3%, PPV 69.6%, and NPV 52.2%. The model demonstrated high specificity but low sensitivity and a high false-negative rate for low-risk-of-bias items.
CONCLUSIONS: The model demonstrated high specificity for identifying high risk of bias across tools, but showed limited sensitivity, especially with NOS, and low NPV with ROB2, indicating a high false-negative rate. While currently useful alongside human oversight, further optimizations is needed to prioritize improving sensitivity for optimal identification of low risk of bias and to enhance overall performance.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR22
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas