GenAI for Critical Appraisal of Evidence for Systematic Literature Reviews (SLRs): A Face-Off between GenAI and Human Reviewers
Author(s)
Sahil Sharma, M. Pharm.1, Sayyeda Anam, M. Sc.2, Rashi Tomer, M. Pharm.2, Ashish Pandey, M. Pharm.2, Sheetal Sharma, MSc1;
1ZS Associates, Gurugram, India, 2ZS Associates, Noida, India
1ZS Associates, Gurugram, India, 2ZS Associates, Noida, India
OBJECTIVES: In Health Economics and Outcomes Research (HEOR), the critical appraisal of randomized controlled trials (RCTs) using the recommended checklists such as NICE checklist is crucial for identifying high-quality evidence for the systematic literature reviews (SLRs). However, the increasing volume of RCTs necessitates a modern, automated solution. This study compares the performance of human reviewers with a trained GenAI agent in appraising RCTs using the NICE checklist.
METHODS: A trained GenAI agent and two independent human reviewers appraised a set of RCTs focused on allergic rhinitis using the NICE checklist. The GenAI agent was developed and trained through prompt engineering to ensure adherence to the checklist. The inter-rater agreement was assessed using Cohen’s Kappa (κ) and as percent agreement between GenAI agent and the human reviewer.
RESULTS: The agreement between the GenAI agent and human reviewers on the NICE checklist responses (Yes/No/Unclear) was 87.5% indicating promising accuracy. Domain-wise agreement was as follows: performance bias: 100%, detection bias: 85%, selection bias: 79.17%, and attrition bias: 75%. Cohen's Kappa (κ) was 0.401 (SE: 0.15, 95% CI: 0.108 to 0.695), indicating fair to moderate agreement. The low Kappa value was due to both the GenAI agent and human reviewers' response as "Yes" for many NICE questions resulting in less variation in responses. This limited Kappa's ability to accurately capture the true level of agreement despite strong alignment.
CONCLUSIONS: The GenAI agent can serve as an initial reviewer for quality assessment using NICE checklist of RCT publications, adding efficiency in the SLR process. Nevertheless, human quality assurance remains crucial to validate outputs and address complexities beyond AI capabilities. With further optimization, this approach could significantly drive the automation of critical appraisal processes in HEOR, improving overall productivity in evidence assessment for SLRs.
METHODS: A trained GenAI agent and two independent human reviewers appraised a set of RCTs focused on allergic rhinitis using the NICE checklist. The GenAI agent was developed and trained through prompt engineering to ensure adherence to the checklist. The inter-rater agreement was assessed using Cohen’s Kappa (κ) and as percent agreement between GenAI agent and the human reviewer.
RESULTS: The agreement between the GenAI agent and human reviewers on the NICE checklist responses (Yes/No/Unclear) was 87.5% indicating promising accuracy. Domain-wise agreement was as follows: performance bias: 100%, detection bias: 85%, selection bias: 79.17%, and attrition bias: 75%. Cohen's Kappa (κ) was 0.401 (SE: 0.15, 95% CI: 0.108 to 0.695), indicating fair to moderate agreement. The low Kappa value was due to both the GenAI agent and human reviewers' response as "Yes" for many NICE questions resulting in less variation in responses. This limited Kappa's ability to accurately capture the true level of agreement despite strong alignment.
CONCLUSIONS: The GenAI agent can serve as an initial reviewer for quality assessment using NICE checklist of RCT publications, adding efficiency in the SLR process. Nevertheless, human quality assurance remains crucial to validate outputs and address complexities beyond AI capabilities. With further optimization, this approach could significantly drive the automation of critical appraisal processes in HEOR, improving overall productivity in evidence assessment for SLRs.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR96
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas