Performance of ChatGPT-4o and Claude 3.5 Sonnet in Title and Abstract Screening for a Systematic Review in Obstetrics

Author(s)

Suppachai Insuk, PharmD, MSc, BCPS1, Kansak Boonpattharatthiti, PharmD2, Chimbun Booncharoen, PharmD Student1, Panitnan Chaipitak, PharmD Student1, Muhammed Rashid, PhD3, Sajesh Veettil, PhD4, Nai Ming Lai, PhD5, Nathorn Chaiyakunapruk, PharmD, PhD3, Teerapon Dhippayom, PharmD, PhD2.
1Faculty of Pharmaceutical Sciences, Naresuan University, Phitsanulok, Thailand, 2The Research Unit of Evidence Synthesis (TRUES), Faculty of Pharmaceutical Sciences, Naresuan University, Phitsanulok, Thailand, 3Department of Pharmacotherapy, College of Pharmacy, University of Utah, Salt Lake City, UT, USA, 4Department of Pharmacy practice, School of Pharmacy, IMU University, Kuala Lumpur, Malaysia, 5School of Medicine, Faculty of Health and Medical Sciences, Taylor's University, Subang Jaya, Malaysia.
OBJECTIVES: The use of generative AI like ChatGPT and Claude in systematic review workflows is increasing, particularly for title and abstract screening. However, comparative performance data remains limited. This study aims to evaluate the performance of ChatGPT-4o and Claude 3.5 Sonnet against junior researchers in the title and abstract screening stage of a systematic review in obstetrics, using an experienced researcher's decisions as the reference standard.
METHODS: A literature search was conducted using PubMed, EMBASE, Cochrane CENTRAL, and EBSCO Open Dissertations from inception till February 2024 on the topic of pharmacological interventions for smoking cessation during pregnancy. These were screened by ChatGPT-4o, Claude 3.5 Sonnet (using a structured prompt), and two junior researchers. Performance was measured against an experienced researcher using accuracy, sensitivity (recall), precision, F1-score, and negative predictive value (NPV). The study design was appropriate for evaluating screening performance, data sources were standard biomedical databases, and analyses involved standard diagnostic test metrics.
RESULTS: A literature search yielded 1,648 unique titles/abstracts for a systematic review on smoking cessation interventions during pregnancy. Junior researchers achieved the highest accuracy (0.9593) and F1-score (0.3853). Claude demonstrated slightly better performance than ChatGPT, with accuracy of 0.9448 versus 0.9138, and an F1-score of 0.3724 versus 0.2755, respectively. Both AI models showed identical recall (0.8182), higher than junior researchers (0.6364). All screeners exhibited high NPV: Claude (0.9961), ChatGPT (0.9959), and junior researchers (0.9924).
CONCLUSIONS: While junior human researchers had the highest overall accuracy, generative AI models, particularly Claude, performed comparably in title/abstract screening for this review, showing high recall and NPV. This suggests AI holds potential as a supportive tool for this stage, though human oversight remains necessary.

Conference/Value in Health Info

2025-09, ISPOR Real-World Evidence Summit 2025, Tokyo, Japan

Value in Health Regional, Volume 49S (September 2025)

Code

RWD158

Topic Subcategory

Reproducibility & Replicability

Disease

STA: Generics

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×