OPTIMIZING THE INTEGRATION OF AI IN A COMPLEX SYSTEMATIC LITERATURE REVIEW: VALIDATION ACROSS TITLE/ABSTRACT AND FULL-TEXT STAGES
Author(s)
Yuting Kuang, PhD1, Lenon Mendes Pereira, PhD2, He (. Jin, MS1, Julien Heidt, MS3, Jennifer Uyei, PhD1;
1IQVIA, San Mateo, CA, USA, 2IQVIA, Boston, MA, USA, 3IQVIA, Carlsbad, CA, USA
1IQVIA, San Mateo, CA, USA, 2IQVIA, Boston, MA, USA, 3IQVIA, Carlsbad, CA, USA
OBJECTIVES: Both human-only and AI-only approaches to literature screening are susceptible to errors, albeit of different types. This study evaluated the optimal degree of AI integration in the screening process of a complex systematic literature review (SLR) that maximizes sensitivity (recall), precision, specificity, F1 (balance between precision and recall), and time savings.
METHODS: An SLR was conducted following a pre-specified protocol and PICOS eligibility criteria. Screening was performed on 862 titles/abstracts and 204 full-text records under five conditions: (1) double human screening (two independent human reviewers) as gold standard; (2) single human screening; (3) double human-AI screening (independent human and AI reviewers); (4) single AI screening with human quality check; and (5) single AI screening. AI screening used a GPT-4 LLM pipeline with prompts based on PICOS criteria and refined on 30 pilot references. Inclusion decisions were compared to the gold standard.
RESULTS: For title/abstract screening, single AI screening achieved ~90% recall and ~60% precision (specificity ~79%, F1 ~0.72), indicating the inclusion of a large proportion of false positives. Double human-AI screening (condition 3) achieved strongest performance: recall 98%, precision 83%, specificity 93%, F1 score 0.90, and 94.1% agreement with the gold standard. Among those with disagreement, 2% (18 references) were errors in the gold standard, suggesting value of AI in increasing screening quality. For full-text screening, single AI screening demonstrated high precision (96%) but lower recall (78%), risking false exclusions. Condition 3 achieved 100% recall, 97% precision, and F1 0.99, matching the gold standard.
CONCLUSIONS: Double human-AI screening offers a viable alternative to double human screening in SLRs, delivering efficiency gains with comparable or increased accuracy. While current AI models show strong performance in hybrid configurations, limitations in nuanced eligibility interpretation remain. Agentic solutions are expected to outperform current AI models, and future validation should explore the interplay between human and agentic AI.
METHODS: An SLR was conducted following a pre-specified protocol and PICOS eligibility criteria. Screening was performed on 862 titles/abstracts and 204 full-text records under five conditions: (1) double human screening (two independent human reviewers) as gold standard; (2) single human screening; (3) double human-AI screening (independent human and AI reviewers); (4) single AI screening with human quality check; and (5) single AI screening. AI screening used a GPT-4 LLM pipeline with prompts based on PICOS criteria and refined on 30 pilot references. Inclusion decisions were compared to the gold standard.
RESULTS: For title/abstract screening, single AI screening achieved ~90% recall and ~60% precision (specificity ~79%, F1 ~0.72), indicating the inclusion of a large proportion of false positives. Double human-AI screening (condition 3) achieved strongest performance: recall 98%, precision 83%, specificity 93%, F1 score 0.90, and 94.1% agreement with the gold standard. Among those with disagreement, 2% (18 references) were errors in the gold standard, suggesting value of AI in increasing screening quality. For full-text screening, single AI screening demonstrated high precision (96%) but lower recall (78%), risking false exclusions. Condition 3 achieved 100% recall, 97% precision, and F1 0.99, matching the gold standard.
CONCLUSIONS: Double human-AI screening offers a viable alternative to double human screening in SLRs, delivering efficiency gains with comparable or increased accuracy. While current AI models show strong performance in hybrid configurations, limitations in nuanced eligibility interpretation remain. Agentic solutions are expected to outperform current AI models, and future validation should explore the interplay between human and agentic AI.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
SA10
Topic
Study Approaches
Topic Subcategory
Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas