Development and Validation of an LLM-Based Study Selection Tool for Automating Systematic Literature Reviews: Achieving Time Efficiency and Maintaining Gold Standard Accuracy

Author(s)

Christopher Olsen, BHSc, Jayson Brian Habib, MPH, Niki Srikanth, BSc, Kevin Hou, PhD, Nicole Ferko, MSc.
Value & Evidence Division, EVERSANA, Burlington, ON, Canada.
OBJECTIVES: Systematic literature reviews (SLRs) require substantial resources to meet high methodological standards. Large language models (LLMs) can expedite review stages; however, maintaining quality is of central importance. This study validated a proprietary LLM tool for title and abstract (TIAB) and full-text screening compared to standard dual-reviewer process.
METHODS: The screening tool was developed in an R-based application that interacts with LLMs via application programming interface (API) calls to conduct TIAB and full-text screening. Specific study eligibility criteria, screening instructions, and single clinical dataset were sent to the LLM API. Responses for each criterion, justification, and eligibility decisions were evaluated for agreement and accuracy against human screening decisions (dual-reviewer process). We also piloted the tool on a full-text clinical dataset, simulating the second-level screening stage.
RESULTS: TIAB screening was performed on a dataset of 4,899 records. The agreement rate between LLM decisions and those of human reviewers was 95.57% (4682/4899; F1-score=0.8658; F2-score=0.9171). Out of the 733 total true inclusions, only 33 (recall=0.9549) were erroneously excluded by the screening tool. The most common reason for disagreement were records classified as narrative reviews or protocols. Compared to the dual-reviewer process, a single-reviewer utilizing the AI tool can achieve an estimated 45% reduction in time required for screening. Preliminary full-text screening of 404 records maintained excellent recall (61/62; 0.9838) and 70.3% agreement rate in under 40 minutes. The performance of the screening tool at TIAB and full-text screening were robust in sensitivity and scenario analyses, based on eligibility criteria, LLM parameters, and document types.
CONCLUSIONS: Our proprietary AI tool demonstrated high accuracy in TIAB screening and preliminary success with full-text documents, reflecting the rapid evolution of LLMs in SLR-specific tasks. Our adaptable LLM-application design prioritizes recall to ensure relevant records are captured and minimizes additional records for human review, showing strong promise for more efficient study selection.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR53

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×