Give Us the Tools and We Will Finish the Job: How Accurate Is AI for Study Selection in Systematic Reviews?
Author(s)
Emma Bishop, BSc, MSc, Alice Sanderson, BSc, MPhil, Katie Reddish, BSc, Emma Carr, BA, Mary Edwards, BA, MA, Rachael McCool, BSc, Lavinia Ferrante di Ruffano, PhD.
York Health Economics Consortium, York, United Kingdom.
York Health Economics Consortium, York, United Kingdom.
OBJECTIVES: The use of artificial intelligence (AI) in systematic reviews (SRs) is becoming more widespread. It is increasingly important to understand whether these systems are sufficiently accurate to replace human reviewers.
METHODS: We retrospectively investigated the accuracy of the AI reviewer in one widely available web-based tool (EasySLR) for study eligibility decisions across three SRs of healthcare interventions. Reviews evaluated the clinical effectiveness of treatments for multiple sclerosis and Adult-Onset Still’s disease (AOSD), or explored the economic burden of AOSD. For each review, screening decisions (include/exclude) based on two independent human reviewers were uploaded to EasySLR. Reasons for full text (FT) exclusion were also uploaded. Accuracy of AI decisions and FT exclusion reasons was measured against the human reference standard.
RESULTS: Across the three reviews, AI-human agreement ranged from 87% to 92% for title and abstract (TA) assessment and 55% to 62% at FT assessment. AI “false negative” rates varied widely: for clinical reviews, the AI alone identified 59% to 80% of eligible TAs and 79% to 93% of eligible FTs. Performance for the economic review was considerably poorer: the AI excluded 81% of eligible TAs and FTs. Exclusion reason agreement ranged from 30% to 40% across all reviews. Differences in performance were multifactorial and included publication format and AI difficulty with patient subgroups, complex outcomes, and study design.
CONCLUSIONS: EasySLR’s AI reviewer is a promising study selection tool for clinical SRs, though should be used alongside human reviewers to achieve acceptable screening accuracy. Its excessive exclusion of eligible records in the economic review suggests the AI had difficulties interpreting the complex PICO. However, new developments enabling better protocol pre-training may help mitigate this. Future evaluations should prospectively assess AI tools to understand how many eligible studies would be missed when using AI to make screening decisions alongside one human.
METHODS: We retrospectively investigated the accuracy of the AI reviewer in one widely available web-based tool (EasySLR) for study eligibility decisions across three SRs of healthcare interventions. Reviews evaluated the clinical effectiveness of treatments for multiple sclerosis and Adult-Onset Still’s disease (AOSD), or explored the economic burden of AOSD. For each review, screening decisions (include/exclude) based on two independent human reviewers were uploaded to EasySLR. Reasons for full text (FT) exclusion were also uploaded. Accuracy of AI decisions and FT exclusion reasons was measured against the human reference standard.
RESULTS: Across the three reviews, AI-human agreement ranged from 87% to 92% for title and abstract (TA) assessment and 55% to 62% at FT assessment. AI “false negative” rates varied widely: for clinical reviews, the AI alone identified 59% to 80% of eligible TAs and 79% to 93% of eligible FTs. Performance for the economic review was considerably poorer: the AI excluded 81% of eligible TAs and FTs. Exclusion reason agreement ranged from 30% to 40% across all reviews. Differences in performance were multifactorial and included publication format and AI difficulty with patient subgroups, complex outcomes, and study design.
CONCLUSIONS: EasySLR’s AI reviewer is a promising study selection tool for clinical SRs, though should be used alongside human reviewers to achieve acceptable screening accuracy. Its excessive exclusion of eligible records in the economic review suggests the AI had difficulties interpreting the complex PICO. However, new developments enabling better protocol pre-training may help mitigate this. Future evaluations should prospectively assess AI tools to understand how many eligible studies would be missed when using AI to make screening decisions alongside one human.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR116
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas