Validating Loon Lens 1.0 for Autonomous Abstract Screening and Confidence-Guided Human-in-the-Loop Workflows in Systematic Reviews

Full Text

Abstract

Objectives

Title and abstract screening is a labor-intensive step in systematic literature reviews (SLRs). We examine the performance of Loon Lens 1.0, an agentic artificial intelligence platform for autonomous title and abstract screening and test whether its confidence scores can target minimal human oversight.

Methods

A total of 8 SLRs by Canada’s Drug Agency were rescreened through dual human reviewers and adjudicated process (3796 citations, 287 includes, 7.6%) and separately by Loon Lens, based on predefined eligibility criteria. Accuracy, sensitivity, precision, and specificity were measured and bootstrapped to generate 95% confidence intervals. Logistic regression with (1) confidence alone and (2) confidence + Include/Exclude decision predicted errors and informed simulated human-in-the-loop strategies.

Results

Loon Lens achieved 95.5% accuracy (95% CI 94.8-96.1), 98.9% sensitivity (97.6-100), 95.2% specificity (94.5-95.9), and 63.0% precision (58.4-67.3). Errors clustered in Low-Medium-confidence Includes. The extended logistic regression model (confidence + decision; C-index 0.98) estimated a 75% error probability for Low-confidence Includes versus 0.1% for Very-High-confidence Excludes. Simulated human-in-the-loop review of Low + Medium-confidence Includes only (145 citations, 3.8%), lifted precision to 81.4% and overall accuracy to 98.2% while preserving sensitivity (99.0%). Adding High-confidence Includes (221 citations, 5.8%) pushed precision to 89.9% and accuracy to 99.0%.

Conclusions

Across 8 SLRs (3796 citations), Loon Lens 1.0 reproduced adjudicated human screening with 98.9% sensitivity and 95.2% specificity. In simulation, restricting human-in-the-loop review to ≤5.8% of citations by prioritizing include calls below very-high confidence, reduced false positives and increased precision to 89.9% while maintaining sensitivity and raising overall accuracy to 99.0%. These findings indicate that confidence-guided oversight can concentrate reviewer effort on a small subset of records.

Authors

Ghayath Janoudi Mara Uzun Tim Disher Mia Jurdana Ena Fuzul Josip Ivkovic Brian Hutton

Back to Volume 28, Issue 11

Abstract

Abstract

Objectives

Methods

Results

Conclusions

Authors

ISPOR–The Professional Society for
Health Economics and Outcomes Research

Your browser is out-of-date