EVALUATION OF MACHINE LEARNING-ASSISTED RISK OF BIAS ASSESSMENTS FOR SUPPORTING SYSTEMATIC REVIEWS. A CASE STUDY OF A REVIEW IN MOVEMENT DISORDERS

Author(s)

Michaela Lunan, PhD, Sathushan Thurairajah, BS Pharmacology, Jose Marcano Belisario, PhD, Louise Hartley, PhD.
RTI Health Solutions, Manchester, United Kingdom.

Presentation Documents

SA22-Lunan-Taylor_OH-Pubs_ISPOR-EU-poster_CAB_05MAY_PRINT.pdf

OBJECTIVES: Machine learning (ML) and artificial intelligence (AI) enable automation of systematic literature review (SLR) components. AI needs evaluation to ensure SLR rigour, and integration into workflows. Objectives were to evaluate AI-assisted risk of bias (ROB) assessment using Cochrane’s RoB2 in a SLR of randomised-controlled trials (RCTs) in movement disorders; considering consistency (measured by accuracy, recall and precision), time spent and implications for SLR researchers.
METHODS: We tested RoB2 in the AI-assisted evidence synthesis tool Nested Knowledge using Adaptive Smart Tags. Prompts for quality appraisal were created consisting of the questions within the RoB2 tool. Data informing the RoB2 were extracted from multiple text types including full-text articles, abstracts and clinicaltrial.gov records. AI ROB assessments were quality checked and supplemented by human researchers. Assessments for each individual reference were collated for linked studies to remove duplicate assessment of a single study. Time spent building and piloting prompts, and quality checking/supplementing AI assessment was compared to researcher averages for manual quality assessment using RoB2.
RESULTS: The SLR included 39 references corresponding to 24 individual studies. AI assessment had a high recall of 0.94 and acceptable accuracy and precision (0.71 each). The most consistent AI failure types were blinding and whether participants were analysed in the correct group. Whilst these errors were consistent, time savings compared to human assessment were noted. Human quality check and error correction took an average of 8 minutes per study compared to 30 minutes for human assessment alone. When accounting for prompting, average time increased to 23 minutes, which would be reduced through repeated use of the prompts for multiple SLRs.
CONCLUSIONS: Findings support the value of AI ROB in SLRs, especially through time saving in the reproducibility of tools for multiple SLRs. Consistent errors from AI assessment highlight the need for human oversight and quality review.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

SA22

Topic

Study Approaches

Topic Subcategory

Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)