ACCELERATING EVIDENCE GENERATION: LEVERAGING LLMS FOR FULL-TEXT STUDY SELECTION

Author(s)

Christopher Olsen, BHSc, Jayson Brian Habib, MPH, Elizabeth Salvo-Halloran, MSc, Sumeet Singh, BScPhm, MSc, Nicole Ferko, MSc.
Value & Evidence, EVERSANA, Victoria, BC, Canada.

Presentation Documents

ISPOR26_Olsen_MSR80_POSTER.pdf

OBJECTIVES: Large language models (LLMs) can expedite the SLR process; however, most published applications focus on abstract screening. Full-text screening is a resource-intensive phase of SLRs where efficiencies could be achieved with AI tools; however, maintaining accuracy remains critical. Suboptimal sensitivity at this stage poses a risk of excluding relevant studies. The objective of the current research was to validate a proprietary LLM-based tool to support full-text screening using a recently published SLR.
METHODS: The AI-assisted tool was implemented using an R-based application integrating LLMs via API calls for preliminary full-text screening. A recently published PRISMA-compliant SLR of clinical trials in Crohn’s disease was selected, comprising 426 records previously screened by two human reviewers which were submitted to the LLM (GPT-4o-2024-11-20). The LLM was prompted with study eligibility criteria, screening instructions, and the full-text articles for evaluation. For each record, the model generated addressed individual criteria, provided justifications, and rendered eligibility decisions. Exclusion decisions and rationale were assessed for agreement and accuracy against the original human decisions. A human reviewer was consulted for verification where LLM decisions deviated from the original review.
RESULTS: Of 426 full-text records, 22 (5.2%) problematic records (e.g., unreadable text) were removed. Among 336 true exclusions, the LLM correctly assigned 223 (66.4%), with only one incorrect exclusion (NPV: 0.996). The process completed in under 40 minutes. In a process where a human reviewer only screens records not excluded by the AI tool, time savings of 42.3% could be achieved versus human review of all records.
CONCLUSIONS: This study supports the use of AI for improving full-text screening efficiency. Calibrating tools for high sensitivity, even at the expense of specificity, may provide an optimal balance of accuracy and efficiency. Improved models or refinements to prompting may further improve efficiency gains.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR80

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)