Evaluating the Efficacy of Artificial Intelligence Models as Suitable Screening Tools in Systematic Reviews: The Example of Glucagon-Like Peptide-1 Receptor Agonists in Heart Failure Treatment

Author(s)

Kyriacos Ioannou, Medical student¹, Sotiris Christoforou, Medical student¹, Maria Kounnafi, Medical student¹, Theodoros Christophides, MSc, MD², Gisella Figlioli, PhD³, Daniele Piovani, MSc, PhD³, Stefanos Bonovas, MD³, Georgios Nikolopoulos, PhD¹.
¹Medical School, University of Cyprus, Nicosia, Cyprus, ²Department of Cardiology, Nicosia General Hospital, Nicosia, Cyprus, ³Humanitas University, Pieve Emanuele, Italy.

OBJECTIVES: This study aimed to evaluate the efficiency and accuracy of AI models for screening of studies in an ongoing systematic review and meta-analysis. For this study, ChatGPT and DeepSeek were utilized.
METHODS: The AI models were used for the screening of articles regarding their inclusion in a systematic review and meta-analysis on the use of Glucagon-Like Peptide-1 Receptor Agonists (GLP-1 RAs) in Heart Failure treatment. The screening phase was divided into 2 parts: a) the manual screening based on predefined criteria and b) the AI-assisted screening where those criteria were used to make prompts for the AI models. Specifically, for a study to be considered eligible, it had to be a Randomized Controlled Trial (RCT) investigating GLP-1 RAs, involving patients diagnosed with heart failure, and reporting at least one relevant predetermined outcome. Furthermore, the time taken for each response from the AI models was recorded. Of note, the AI models had access to only publicly available information from the web.
RESULTS: Among the 48 eligible studies, ChatGPT correctly identified 35 (72.91%) (including studies labelled as partially eligible), while DeepSeek correctly identified 31 (64.58%). Among the 230 non-eligible studies, ChatGPT correctly classified 195 (84.78%) as not-suitable, while DeepSeek correctly classified 201 (87.39%). As a result, and in total, ChatGPT correctly identified 230 studies (82.73%), while DeepSeek correctly identified 232 studies (83.45%). Regarding time efficiency, the average time to evaluate the eligibility for ChatGPT was 25.87 seconds and for DeepSeek 32.80 seconds.
CONCLUSIONS: This study highlights the potential of AI models in supporting systematic reviews for the screening phase, emphasizing DeepSeek’s slightly better accuracy and ChatGPT’s better speed.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

MSR93

Topic

Epidemiology & Public Health, Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

Cardiovascular Disorders (including MI, Stroke, Circulatory)

Presentation (CTI)