ASSESSMENT OF AUTONOMOUS LITERATURE REVIEW SYSTEMS FOR UMBRELLA REVIEW SCREENING
Author(s)
Kevin Kallmes, BS, MA, JD1, Vicki Young, MD2, Elizabeth Salvo-Halloran, MS3, Sumeet Singh, BSc, MSc4, Nicole Ferko, MSc3;
1Nested Knowledge, CEO, St. Paul, MN, USA, 2Systematic Review Ltd., London, United Kingdom, 3EVERSANA, Burlington, ON, Canada, 4EVERSANA, Nepean, ON, Canada
1Nested Knowledge, CEO, St. Paul, MN, USA, 2Systematic Review Ltd., London, United Kingdom, 3EVERSANA, Burlington, ON, Canada, 4EVERSANA, Nepean, ON, Canada
OBJECTIVES: Leaders in evidence synthesis methodology have recently published guidance on the use of Artificial Intelligence (AI) in systematic reviews, including the Responsible AI for Systematic Evidence (RAISE) guidelines and the PRISMA-Transparent Reporting of Artificial Intelligence in Comprehensive Evidence Synthesis (PRISMA-trAIce) framework. Core principles for responsible integration of AI include transparency, traceability, and maintaining human‑in‑the‑loop approaches. Criteria‑Based Screening (CBS) supports transparent, compliant abstract and full‑text AI screening using human‑defined Yes/No questions that mirror the review’s eligibility criteria. We assessed the performance of CBS‑enabled AI screening through comparison to human decisions from a published umbrella review of systematic reviews.
METHODS: We assessed the accuracy of CBS in the Nested Knowledge platform using an analysis sample comprising records screened in a published umbrella review of the safety of proton pump inhibitors, using human decisions as a gold standard. Population, Interventions/Comparators, Outcomes, and Study Design (PICOS) criteria were converted into dichotomous Yes/No questions. AI inclusion required meeting all four criteria at abstract level and five at full‑text, with automatic PRISMA tracking. Recall, Precision, and overall Accuracy of AI‑generated decisions were calculated against human screening.
RESULTS: Of 775 candidate records, humans advanced 94 to full-text review and advanced 43 to included status. All criteria were satisfied for 148 abstracts and 58 full texts. Recall was 97.7% at both abstract and full-text stages, Precision was 28.4% at abstract stage and 72.4% at full-text, and Accuracy was 86.2% and 81.9%, respectively. Full-text false positives were qualitatively analysed; the majority met all criteria except focusing on too narrow of disease states.
CONCLUSIONS: Highly traceable, autonomous AI screening on individual criteria was feasible and achieved high Recall and Accuracy in an umbrella review. CBS’s full traceability in a human-in-the-loop workflow extends beyond abstract-only, black-box methods, and may be useful in either fully-autonomous targeted reviews or to support human-in-the-loop systematic reviews.
METHODS: We assessed the accuracy of CBS in the Nested Knowledge platform using an analysis sample comprising records screened in a published umbrella review of the safety of proton pump inhibitors, using human decisions as a gold standard. Population, Interventions/Comparators, Outcomes, and Study Design (PICOS) criteria were converted into dichotomous Yes/No questions. AI inclusion required meeting all four criteria at abstract level and five at full‑text, with automatic PRISMA tracking. Recall, Precision, and overall Accuracy of AI‑generated decisions were calculated against human screening.
RESULTS: Of 775 candidate records, humans advanced 94 to full-text review and advanced 43 to included status. All criteria were satisfied for 148 abstracts and 58 full texts. Recall was 97.7% at both abstract and full-text stages, Precision was 28.4% at abstract stage and 72.4% at full-text, and Accuracy was 86.2% and 81.9%, respectively. Full-text false positives were qualitatively analysed; the majority met all criteria except focusing on too narrow of disease states.
CONCLUSIONS: Highly traceable, autonomous AI screening on individual criteria was feasible and achieved high Recall and Accuracy in an umbrella review. CBS’s full traceability in a human-in-the-loop workflow extends beyond abstract-only, black-box methods, and may be useful in either fully-autonomous targeted reviews or to support human-in-the-loop systematic reviews.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR34
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas