Artificial Intelligence in Systematic Reviews: An Investigation Into the Impact of Eligible Studies Being Excluded by Artificial Intelligence

Author(s)

Emma Bishop, BSc, MSc, Alice Sanderson, BSc, MPhil, Katie Reddish, BSc, Emma Carr, BA (Hons) History, Mary Edwards, BA, MA, Rachael McCool, BSc, Lavinia Ferrante di Ruffano, PhD.
York Health Economics Consortium, York, United Kingdom.
OBJECTIVES: Artificial intelligence (AI) can be used in systematic reviews (SRs) to increase study selection speed. However, this can lead to discrepancies in screening decisions between AI and human reviewers. We investigated the impact of these discrepancies at the full text (FT) study selection stage.
METHODS: The AI reviewer within a web-based tool (EasySLR) was assessed by uploading human FT screening decisions (include/exclude) from three completed SRs of health interventions: R1 investigated the economic burden of Adult-Onset Still’s disease (AOSD), R2 and R3 evaluated the clinical effectiveness of treatments for multiple sclerosis and AOSD, respectively. AI decisions were compared with final decisions after double independent human screening. We assessed the potential impact of AI false-negative exclusions by examining the influence that excluding each eligible record had on the review’s results.
RESULTS: The number of incorrectly excluded records from R1-R3 was 13 (45% of all AI excludes), 2 (1%), and 14 (12%) respectively: 29 records overall. Of these records, 17 primary publications were falsely excluded. We considered that 12 of these 17 records could have impacted the review results because: reported on a unique geographical cohort or unique patient subgroup, reported an outcome or timepoint that was not addressed by other included studies, or reported conflicting findings to those of other included studies. The remaining 12 records were either linked to included primary studies or reported similar results to other included studies.
CONCLUSIONS: This investigation demonstrated that the number of discrepancies between AI and human reviewer decisions during FT screening varies considerably across reviews, as does the impact of these discrepancies. To circumvent this, differences between AI and human screening decisions should be checked to ensure unique data are not incorrectly excluded. New developments to EasySLR enabling protocol pre-training may also help reduce these discrepancies.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

MSR36

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×