Zero-Shot RCT Identification Using Large Language Models: A Comparative Study With the Cochrane Classifier

Author(s)

Seye Abogunrin, MPH, MSc, MD¹, Roberto Rey Sieiro, BSc, MSc², Marie Lane, BSc³.
¹Global Access Evidence Leader, F. Hoffmann-La Roche Ltd, Basel, Switzerland, ²Roche Farma, S.A.,, Madrid, Spain, ³F. Hoffmann-La Roche Ltd, Basel, Switzerland.

OBJECTIVES: To accelerate systematic reviews, automatically distinguishing randomized controlled trials (RCTs) aids record filtering, work assignment and screening. While specialized tools exist, modern large language models (LLMs) offer a promising alternative. We evaluated the performance of a prompt-engineered GPT-4.1 model for RCT identification, directly comparing it to the Cochrane classifier, an established benchmark. The goal was to assess the overall classification effectiveness of a general-purpose LLM versus a specialized tool.
METHODS: A specialized prompt was developed to instruct GPT-4.1 to distinguish RCTs in a title and abstract dataset. For comparison, the same dataset was processed by the Cochrane classifier. Key performance metrics—accuracy, precision, recall, and F1-score—were calculated for both methods to provide a comprehensive evaluation of their RCT identification capabilities.
RESULTS: The RCT prompt was applied in a zero-shot workflow to a dataset of 2,000 title and abstract records. The LLM demonstrated superior overall performance, achieving an F1-score of 0.737, significantly outperforming the Cochrane classifier’s score of 0.485. The LLM’s performance was characterized by high accuracy (91.1%) and high precision (78.1%), with a strong recall of 69.8%. While the Cochrane classifier achieved a higher recall (86.3%), this came at a steep cost to its precision (33.7%) and overall accuracy (67.4%), leading to a lower F1-score and a high volume of false positives.
CONCLUSIONS: A prompt-engineered GPT-4.1 provides a potentially more balanced and robust solution for RCT identification than the Cochrane classifier. The Cochrane classifier is optimized for recall, which can lead to unnecessary review of title and abstract records in a double-screening situation. The LLM’s substantially higher F1-score highlights its ability to effectively distinguish RCTs, enabling review teams to assess and manage datasets efficiently. Its superior overall performance, delivering high accuracy and precision plus strong recall with no task-specific training, makes it a more practical tool for resource constrained systematic review workflows.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

SA107

Topic

Health Technology Assessment, Methodological & Statistical Research, Study Approaches

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)

Author(s)

Conference/Value in Health Info

Code

Topic

Disease

ISPOR–The Professional Society for
Health Economics and Outcomes Research

Your browser is out-of-date