ASSESSING AI AND ML TOOL PERFORMANCE IN SLRS: A TARGETED LITERATURE REVIEW AND PERFORMANCE BENCHMARK FRAMEWORK

Author(s)

Raju Gautam, PhD¹, Saeed Anwar, MSc², Ratna Pandey, MSc², Khushbu Baranwal, MSc², Tushar Srivastava, MSc¹;
¹ConnectHEOR, London, United Kingdom, ²ConnectHEOR, Delhi, India

Presentation Documents

MSR88_Gautam_AI SLR Tools Performance Benchmark.pdf

OBJECTIVES: The use of artificial intelligence/machine learning (AI/ML) tools in systematic literature reviews (SLRs) has increased due to its ability to streamline the resource-intensive process. However, despite growing adoption, health technology assessment (HTA) agencies lack clear, evidence-based performance benchmarks to assess the reliability of AI-assisted SLR workflows. This review aims to summarize reported performance metrics of AI-based SLR tools and propose evidence-informed optimal benchmarks for AI-performance.
METHODS: A targeted review was conducted using PubMed, Google Scholar and the ISPOR database published in last 5-years. Eligible sources evaluated AI-based SLR tools and reported at least one performance metric (accuracy, sensitivity, and specificity). SLR steps of interest included title/abstract screening, full-text screening, and data-extraction. Performance metrics were extracted and synthesized descriptively.
RESULTS: A total of 25 studies were identified, 8 full-publication and 17 ISPOR abstracts/posters. Reported performance covered key SLR stages, different AI models and various disease areas. For title/abstract screening, accuracy (AI-human identical decision) ranged from 10-100%, sensitivity (correct inclusion by AI) ranged from 14-≥99%, and specificity (correct exclusion by AI) from 19-99%. For full-text screening, AI tools demonstrated consistently high sensitivity (76-99%), while specificity (19-77%) and accuracy (40-98.5%) was more variable, reflecting conservative exclusion approaches designed to minimize missed evidence. While only five studies reported accuracy of data-extraction (40-100%). Based on the synthesized evidence, optimal performance benchmarks relevant to HTA standards were proposed: Title/abstract screening - accuracy ≥90%, sensitivity ≥95%, specificity ≥85%; Full-text screening - accuracy ≥95%, sensitivity ≥98%, specificity ≥90%; data-extraction - accuracy ≥95%.
CONCLUSIONS: Current evidence indicates that well-configured and validated AI-based tools can achieve sensitivity and specificity comparable to human reviewer. For HTA agencies, sensitivity should be the primary endpoint to minimize missed evidence. AI-assisted SLRs should be implemented within transparent, human-in-the-loop workflows. This performance benchmark will provide a practical validation framework to support HTA agencies’ confidence in AI-enabled evidence synthesis.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR88

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)