A TALE OF TWO THRESHOLDS: ADAPTIVE THRESHOLDING WITH ABSTENTION FOR AI-ASSISTED SINGLE SCREENING (AISS)
Author(s)
Artur Nowak, MSc, Monika Opalek, PhD, Ewelina Sadowska, MPharm, Ewa Borowiack, MSc;
Evidence Prime, Krakow, Poland
Evidence Prime, Krakow, Poland
OBJECTIVES: To evaluate an AI-assisted single screening (AISS) strategy designed to maintain high recall while reducing manual workload. Title and abstract screening in systematic reviews is a high-effort, unbalanced process prone to "silent misses" when automation is applied. This study specifically assesses the effectiveness of abstention (postponing decisions) as a safety mechanism to mitigate risks arising from poor model fit or ambiguous inclusion criteria.
METHODS: We simulated screening on 160 Cochrane reviews, tuning strategy parameters on 60 and evaluating on a held-out test set of 100 reviews. The two threshold strategy turns classifier output into one of three actions (include, exclude, postpone). We tested the method under simulation of realistic human errors (random and borderline-case uncertainty) plus two stress tests: automation bias (high human-model correlation) and model collapse (model trained on random labels). Outcomes included recall, conflict rate, and number deferred for manual review.
RESULTS: Under realistic human error scenarios, the strategy maintained high recall (>= 95%) with only a small fraction finalized with 'postpone' status and kept for manual review (2-3%). With higher rate of errors on borderline cases, recall decreased modestly, but conflict increased, providing a clearer warning signal for conflict resolution. In the model collapse stress test, the strategy deferred substantially more records (12.6%) yet preserved high recall (97.5%), demonstrating a safety mechanism that trades automation for manual screening when scores are unreliable. In contrast, under automation bias, recall dropped sharply (<90%) while both conflict and abstention stayed low, indicating a high-risk regime where agreement can mask systematic misses.
CONCLUSIONS: Adaptive thresholding with abstention can preserve recall by shifting uncertain cases back to humans, and conflict and abstention rates provide useful safety signals. We intentionally used a weaker (non-LLM) model to test the strategy’s behavior. However, the approach is classifier-agnostic and evaluation with an LLM-based classifier is ongoing.
METHODS: We simulated screening on 160 Cochrane reviews, tuning strategy parameters on 60 and evaluating on a held-out test set of 100 reviews. The two threshold strategy turns classifier output into one of three actions (include, exclude, postpone). We tested the method under simulation of realistic human errors (random and borderline-case uncertainty) plus two stress tests: automation bias (high human-model correlation) and model collapse (model trained on random labels). Outcomes included recall, conflict rate, and number deferred for manual review.
RESULTS: Under realistic human error scenarios, the strategy maintained high recall (>= 95%) with only a small fraction finalized with 'postpone' status and kept for manual review (2-3%). With higher rate of errors on borderline cases, recall decreased modestly, but conflict increased, providing a clearer warning signal for conflict resolution. In the model collapse stress test, the strategy deferred substantially more records (12.6%) yet preserved high recall (97.5%), demonstrating a safety mechanism that trades automation for manual screening when scores are unreliable. In contrast, under automation bias, recall dropped sharply (<90%) while both conflict and abstention stayed low, indicating a high-risk regime where agreement can mask systematic misses.
CONCLUSIONS: Adaptive thresholding with abstention can preserve recall by shifting uncertain cases back to humans, and conflict and abstention rates provide useful safety signals. We intentionally used a weaker (non-LLM) model to test the strategy’s behavior. However, the approach is classifier-agnostic and evaluation with an LLM-based classifier is ongoing.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR77
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas