Artificial Intelligence in Systematic Reviews: An Investigation Into the Impact of Eligible Studies Being Excluded by Artificial Intelligence
Author(s)
Emma Bishop, BSc, MSc, Alice Sanderson, BSc, MPhil, Katie Reddish, BSc, Emma Carr, BA (Hons) History, Mary Edwards, BA, MA, Rachael McCool, BSc, Lavinia Ferrante di Ruffano, PhD.
York Health Economics Consortium, York, United Kingdom.
York Health Economics Consortium, York, United Kingdom.
OBJECTIVES: Artificial intelligence (AI) can be used in systematic reviews (SRs) to increase study selection speed. However, this can lead to discrepancies in screening decisions between AI and human reviewers. We investigated the impact of these discrepancies at the full text (FT) study selection stage.
METHODS: The AI reviewer within a web-based tool (EasySLR) was assessed by uploading human FT screening decisions (include/exclude) from three completed SRs of health interventions: R1 investigated the economic burden of Adult-Onset Still’s disease (AOSD), R2 and R3 evaluated the clinical effectiveness of treatments for multiple sclerosis and AOSD, respectively. AI decisions were compared with final decisions after double independent human screening. We assessed the potential impact of AI false-negative exclusions by examining the influence that excluding each eligible record had on the review’s results.
RESULTS: The number of incorrectly excluded records from R1-R3 was 13 (45% of all AI excludes), 2 (1%), and 14 (12%) respectively: 29 records overall. Of these records, 17 primary publications were falsely excluded. We considered that 12 of these 17 records could have impacted the review results because: reported on a unique geographical cohort or unique patient subgroup, reported an outcome or timepoint that was not addressed by other included studies, or reported conflicting findings to those of other included studies. The remaining 12 records were either linked to included primary studies or reported similar results to other included studies.
CONCLUSIONS: This investigation demonstrated that the number of discrepancies between AI and human reviewer decisions during FT screening varies considerably across reviews, as does the impact of these discrepancies. To circumvent this, differences between AI and human screening decisions should be checked to ensure unique data are not incorrectly excluded. New developments to EasySLR enabling protocol pre-training may also help reduce these discrepancies.
METHODS: The AI reviewer within a web-based tool (EasySLR) was assessed by uploading human FT screening decisions (include/exclude) from three completed SRs of health interventions: R1 investigated the economic burden of Adult-Onset Still’s disease (AOSD), R2 and R3 evaluated the clinical effectiveness of treatments for multiple sclerosis and AOSD, respectively. AI decisions were compared with final decisions after double independent human screening. We assessed the potential impact of AI false-negative exclusions by examining the influence that excluding each eligible record had on the review’s results.
RESULTS: The number of incorrectly excluded records from R1-R3 was 13 (45% of all AI excludes), 2 (1%), and 14 (12%) respectively: 29 records overall. Of these records, 17 primary publications were falsely excluded. We considered that 12 of these 17 records could have impacted the review results because: reported on a unique geographical cohort or unique patient subgroup, reported an outcome or timepoint that was not addressed by other included studies, or reported conflicting findings to those of other included studies. The remaining 12 records were either linked to included primary studies or reported similar results to other included studies.
CONCLUSIONS: This investigation demonstrated that the number of discrepancies between AI and human reviewer decisions during FT screening varies considerably across reviews, as does the impact of these discrepancies. To circumvent this, differences between AI and human screening decisions should be checked to ensure unique data are not incorrectly excluded. New developments to EasySLR enabling protocol pre-training may also help reduce these discrepancies.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR36
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas