OPTIMIZING THE INTEGRATION OF AI IN A COMPLEX SYSTEMATIC LITERATURE REVIEW: VALIDATION ACROSS TITLE/ABSTRACT AND FULL-TEXT STAGES

Author(s)

Yuting Kuang, PhD¹, Lenon Mendes Pereira, PhD², He (. Jin, MS¹, Julien Heidt, MS³, Jennifer Uyei, PhD¹;
¹IQVIA, San Mateo, CA, USA, ²IQVIA, Boston, MA, USA, ³IQVIA, Carlsbad, CA, USA

OBJECTIVES: Both human-only and AI-only approaches to literature screening are susceptible to errors, albeit of different types. This study evaluated the optimal degree of AI integration in the screening process of a complex systematic literature review (SLR) that maximizes sensitivity (recall), precision, specificity, F1 (balance between precision and recall), and time savings.
METHODS: An SLR was conducted following a pre-specified protocol and PICOS eligibility criteria. Screening was performed on 862 titles/abstracts and 204 full-text records under five conditions: (1) double human screening (two independent human reviewers) as gold standard; (2) single human screening; (3) double human-AI screening (independent human and AI reviewers); (4) single AI screening with human quality check; and (5) single AI screening. AI screening used a GPT-4 LLM pipeline with prompts based on PICOS criteria and refined on 30 pilot references. Inclusion decisions were compared to the gold standard.
RESULTS: For title/abstract screening, single AI screening achieved ~90% recall and ~60% precision (specificity ~79%, F1 ~0.72), indicating the inclusion of a large proportion of false positives. Double human-AI screening (condition 3) achieved strongest performance: recall 98%, precision 83%, specificity 93%, F1 score 0.90, and 94.1% agreement with the gold standard. Among those with disagreement, 2% (18 references) were errors in the gold standard, suggesting value of AI in increasing screening quality. For full-text screening, single AI screening demonstrated high precision (96%) but lower recall (78%), risking false exclusions. Condition 3 achieved 100% recall, 97% precision, and F1 0.99, matching the gold standard.
CONCLUSIONS: Double human-AI screening offers a viable alternative to double human screening in SLRs, delivering efficiency gains with comparable or increased accuracy. While current AI models show strong performance in hybrid configurations, limitations in nuanced eligibility interpretation remain. Agentic solutions are expected to outperform current AI models, and future validation should explore the interplay between human and agentic AI.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

SA10

Topic

Study Approaches

Topic Subcategory

Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)