PERFORMANCE METRICS FOR AI IN SLRS: A RAPID LITERATURE REVIEW OF GUIDELINE RECOMMENDATIONS FOR ACCEPTABLE THRESHOLDS
Author(s)
Stephan N. Martin, MPH, Alyssa Simon, MPH, Grace E. Fox, PhD;
OPEN Health, New York, NY, USA
OPEN Health, New York, NY, USA
OBJECTIVES: With artificial intelligence (AI) increasingly used in systematic literature reviews (SLRs), we aimed to identify and synthesize guidance from health technology assessment (HTA) bodies and other industry organizations about measuring AI performance.
METHODS: An AI web crawler was used to review methodological standards published between 2023 and 2025 about using AI in SLRs. Sources included reporting guidelines (e.g., PRISMA), methodological frameworks (RAISE guidelines), and position statements from HTA bodies (NICE, CDA, IQWiG). Information was extracted on performance metrics and acceptable performance thresholds for AI used in title/abstract screening and full-text review.
RESULTS: While HTA bodies have not reported numerical thresholds for performance metrics, consensus is emerging among HTA bodies and other sources that AI must have very high sensitivity, not miss any important evidence, and be non-inferior to dual-human processes. IQWiG’s 2023 methodology guidance allows validated AI classifiers to support study selection, citing Cochrane’s Study Classifier as such a tool, which achieves 99% sensitivity. In contrast, precision is considered secondary to sensitivity, and modest precision may be deemed acceptable to safeguard sensitivity. Additionally, although a workload reduction of >30% is typically required for tool viability, CDA’s position statement noted that any efficiency obtained with AI must be balanced against potential risks (i.e., missing evidence due to lower sensitivity). In the absence of explicit thresholds, transparency becomes the proxy, with RAISE guidelines and the PRISMA-trAIce checklist recommending authors report appropriate performance metrics and methodology for validation of any AI tool(s) used.
CONCLUSIONS: Although our research did not identify specific numerical thresholds for AI performance metrics, we advise those using AI for HTA-grade SLRs to prioritize high sensitivity, even at the cost of lower precision, to align with emerging regulatory standards, assuming there still remains a workload reduction. In such an approach, AI only excludes clearly irrelevant records while humans verify all potential inclusions.
METHODS: An AI web crawler was used to review methodological standards published between 2023 and 2025 about using AI in SLRs. Sources included reporting guidelines (e.g., PRISMA), methodological frameworks (RAISE guidelines), and position statements from HTA bodies (NICE, CDA, IQWiG). Information was extracted on performance metrics and acceptable performance thresholds for AI used in title/abstract screening and full-text review.
RESULTS: While HTA bodies have not reported numerical thresholds for performance metrics, consensus is emerging among HTA bodies and other sources that AI must have very high sensitivity, not miss any important evidence, and be non-inferior to dual-human processes. IQWiG’s 2023 methodology guidance allows validated AI classifiers to support study selection, citing Cochrane’s Study Classifier as such a tool, which achieves 99% sensitivity. In contrast, precision is considered secondary to sensitivity, and modest precision may be deemed acceptable to safeguard sensitivity. Additionally, although a workload reduction of >30% is typically required for tool viability, CDA’s position statement noted that any efficiency obtained with AI must be balanced against potential risks (i.e., missing evidence due to lower sensitivity). In the absence of explicit thresholds, transparency becomes the proxy, with RAISE guidelines and the PRISMA-trAIce checklist recommending authors report appropriate performance metrics and methodology for validation of any AI tool(s) used.
CONCLUSIONS: Although our research did not identify specific numerical thresholds for AI performance metrics, we advise those using AI for HTA-grade SLRs to prioritize high sensitivity, even at the cost of lower precision, to align with emerging regulatory standards, assuming there still remains a workload reduction. In such an approach, AI only excludes clearly irrelevant records while humans verify all potential inclusions.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
SA35
Topic
Study Approaches
Topic Subcategory
Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas