A COMPARATIVE ANALYSIS OF LARGE LANGUAGE MODELS IN TITLE AND ABSTRACT SCREENING GUIDED BY HUMAN OVERSIGHT

Author(s)

Ankita Sood, PharmD, Ritesh Dubey, PharmD, Sunil Kumar, M.Pharm, Marjana Bharali, B.E., Gagandeep Kaur, M.Pharm, Rajdeep Kaur, PhD, Barinder Singh, RPh;
Pharmacoevidence, Mohali, India
OBJECTIVES: Systematic literature reviews (SLRs) are considered the gold standard in evidence-based medicine; however, the process of conducting them is very resource-intensive, expensive, and time-consuming. This study aims to evaluate the relative efficiency of large language models (LLMs) to automate the title and abstract screening process in SLRs in alignment with the NICE and CDA-AMC guidelines.
METHODS: EMBASE®, Medline®, and Cochrane were searched to identify relevant randomized controlled trials (RCTs) related to a psychiatric disorder. A Python-based interface was developed to support automated title and abstract screening using multiple LLMs (Claude Sonnet 3.7, Gemini Flash 2.5, and GPT4-o-mini), guided by predefined inclusion and exclusion criteria. Screening decisions were finalized when all models agreed; records with discordant outputs were escalated for manual review. A subject matter expert (SME) with over a decade of domain knowledge optimized, fine-tuned the final prompt, and conducted quality control on a sample of artificial intelligence (AI)-processed records to ensure accuracy and assess overall model performance.
RESULTS: Overall, all three AI models performed exceptionally well in screening based on titles and abstracts. While there were no significant differences in accuracy rates, Claude Sonnet 3.7 exhibited the highest accuracy rate at 97.34%, followed by Gemini Flash 2.5 at 95.05% and GPT4-o-mini at 93.48%. In terms of sensitivity, Claude Sonnet 3.7 suggested better results, attaining 98.79% of sensitivity, followed by Gemini Flash 2.5 with 94.86% and GPT4-o-mini with 93.66%.
CONCLUSIONS: This study demonstrates that AI can be effectively incorporated into the SLR process to facilitate title and abstract screening. All evaluated LLMs achieved accuracy rates exceeding 90%, thereby suggesting that the combined use of AI automation and expert oversight significantly reduces manual effort, while maintaining accuracy and without compromising the quality of review outcomes.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR106

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×