Development and Validation of an LLM-Based Study Selection Tool for Automating Systematic Literature Reviews: Achieving Time Efficiency and Maintaining Gold Standard Accuracy
Author(s)
Christopher Olsen, BHSc, Jayson Brian Habib, MPH, Niki Srikanth, BSc, Kevin Hou, PhD, Nicole Ferko, MSc.
Value & Evidence Division, EVERSANA, Burlington, ON, Canada.
Value & Evidence Division, EVERSANA, Burlington, ON, Canada.
OBJECTIVES: Systematic literature reviews (SLRs) require substantial resources to meet high methodological standards. Large language models (LLMs) can expedite review stages; however, maintaining quality is of central importance. This study validated a proprietary LLM tool for title and abstract (TIAB) and full-text screening compared to standard dual-reviewer process.
METHODS: The screening tool was developed in an R-based application that interacts with LLMs via application programming interface (API) calls to conduct TIAB and full-text screening. Specific study eligibility criteria, screening instructions, and single clinical dataset were sent to the LLM API. Responses for each criterion, justification, and eligibility decisions were evaluated for agreement and accuracy against human screening decisions (dual-reviewer process). We also piloted the tool on a full-text clinical dataset, simulating the second-level screening stage.
RESULTS: TIAB screening was performed on a dataset of 4,899 records. The agreement rate between LLM decisions and those of human reviewers was 95.57% (4682/4899; F1-score=0.8658; F2-score=0.9171). Out of the 733 total true inclusions, only 33 (recall=0.9549) were erroneously excluded by the screening tool. The most common reason for disagreement were records classified as narrative reviews or protocols. Compared to the dual-reviewer process, a single-reviewer utilizing the AI tool can achieve an estimated 45% reduction in time required for screening. Preliminary full-text screening of 404 records maintained excellent recall (61/62; 0.9838) and 70.3% agreement rate in under 40 minutes. The performance of the screening tool at TIAB and full-text screening were robust in sensitivity and scenario analyses, based on eligibility criteria, LLM parameters, and document types.
CONCLUSIONS: Our proprietary AI tool demonstrated high accuracy in TIAB screening and preliminary success with full-text documents, reflecting the rapid evolution of LLMs in SLR-specific tasks. Our adaptable LLM-application design prioritizes recall to ensure relevant records are captured and minimizes additional records for human review, showing strong promise for more efficient study selection.
METHODS: The screening tool was developed in an R-based application that interacts with LLMs via application programming interface (API) calls to conduct TIAB and full-text screening. Specific study eligibility criteria, screening instructions, and single clinical dataset were sent to the LLM API. Responses for each criterion, justification, and eligibility decisions were evaluated for agreement and accuracy against human screening decisions (dual-reviewer process). We also piloted the tool on a full-text clinical dataset, simulating the second-level screening stage.
RESULTS: TIAB screening was performed on a dataset of 4,899 records. The agreement rate between LLM decisions and those of human reviewers was 95.57% (4682/4899; F1-score=0.8658; F2-score=0.9171). Out of the 733 total true inclusions, only 33 (recall=0.9549) were erroneously excluded by the screening tool. The most common reason for disagreement were records classified as narrative reviews or protocols. Compared to the dual-reviewer process, a single-reviewer utilizing the AI tool can achieve an estimated 45% reduction in time required for screening. Preliminary full-text screening of 404 records maintained excellent recall (61/62; 0.9838) and 70.3% agreement rate in under 40 minutes. The performance of the screening tool at TIAB and full-text screening were robust in sensitivity and scenario analyses, based on eligibility criteria, LLM parameters, and document types.
CONCLUSIONS: Our proprietary AI tool demonstrated high accuracy in TIAB screening and preliminary success with full-text documents, reflecting the rapid evolution of LLMs in SLR-specific tasks. Our adaptable LLM-application design prioritizes recall to ensure relevant records are captured and minimizes additional records for human review, showing strong promise for more efficient study selection.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR53
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas