ACCELERATING EVIDENCE GENERATION: LEVERAGING LLMS FOR FULL-TEXT STUDY SELECTION
Author(s)
Christopher Olsen, BHSc, Jayson Brian Habib, MPH, Elizabeth Salvo-Halloran, MSc, Sumeet Singh, BScPhm, MSc, Nicole Ferko, MSc.
Value & Evidence, EVERSANA, Victoria, BC, Canada.
Value & Evidence, EVERSANA, Victoria, BC, Canada.
OBJECTIVES: Large language models (LLMs) can expedite the SLR process; however, most published applications focus on abstract screening. Full-text screening is a resource-intensive phase of SLRs where efficiencies could be achieved with AI tools; however, maintaining accuracy remains critical. Suboptimal sensitivity at this stage poses a risk of excluding relevant studies. The objective of the current research was to validate a proprietary LLM-based tool to support full-text screening using a recently published SLR.
METHODS: The AI-assisted tool was implemented using an R-based application integrating LLMs via API calls for preliminary full-text screening. A recently published PRISMA-compliant SLR of clinical trials in Crohn’s disease was selected, comprising 426 records previously screened by two human reviewers which were submitted to the LLM (GPT-4o-2024-11-20). The LLM was prompted with study eligibility criteria, screening instructions, and the full-text articles for evaluation. For each record, the model generated addressed individual criteria, provided justifications, and rendered eligibility decisions. Exclusion decisions and rationale were assessed for agreement and accuracy against the original human decisions. A human reviewer was consulted for verification where LLM decisions deviated from the original review.
RESULTS: Of 426 full-text records, 22 (5.2%) problematic records (e.g., unreadable text) were removed. Among 336 true exclusions, the LLM correctly assigned 223 (66.4%), with only one incorrect exclusion (NPV: 0.996). The process completed in under 40 minutes. In a process where a human reviewer only screens records not excluded by the AI tool, time savings of 42.3% could be achieved versus human review of all records.
CONCLUSIONS: This study supports the use of AI for improving full-text screening efficiency. Calibrating tools for high sensitivity, even at the expense of specificity, may provide an optimal balance of accuracy and efficiency. Improved models or refinements to prompting may further improve efficiency gains.
METHODS: The AI-assisted tool was implemented using an R-based application integrating LLMs via API calls for preliminary full-text screening. A recently published PRISMA-compliant SLR of clinical trials in Crohn’s disease was selected, comprising 426 records previously screened by two human reviewers which were submitted to the LLM (GPT-4o-2024-11-20). The LLM was prompted with study eligibility criteria, screening instructions, and the full-text articles for evaluation. For each record, the model generated addressed individual criteria, provided justifications, and rendered eligibility decisions. Exclusion decisions and rationale were assessed for agreement and accuracy against the original human decisions. A human reviewer was consulted for verification where LLM decisions deviated from the original review.
RESULTS: Of 426 full-text records, 22 (5.2%) problematic records (e.g., unreadable text) were removed. Among 336 true exclusions, the LLM correctly assigned 223 (66.4%), with only one incorrect exclusion (NPV: 0.996). The process completed in under 40 minutes. In a process where a human reviewer only screens records not excluded by the AI tool, time savings of 42.3% could be achieved versus human review of all records.
CONCLUSIONS: This study supports the use of AI for improving full-text screening efficiency. Calibrating tools for high sensitivity, even at the expense of specificity, may provide an optimal balance of accuracy and efficiency. Improved models or refinements to prompting may further improve efficiency gains.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR80
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas