Evaluating the Efficacy of Artificial Intelligence Models as Suitable Screening Tools in Systematic Reviews: The Example of Glucagon-Like Peptide-1 Receptor Agonists in Heart Failure Treatment
Author(s)
Kyriacos Ioannou, Medical student1, Sotiris Christoforou, Medical student1, Maria Kounnafi, Medical student1, Theodoros Christophides, MSc, MD2, Gisella Figlioli, PhD3, Daniele Piovani, MSc, PhD3, Stefanos Bonovas, MD3, Georgios Nikolopoulos, PhD1.
1Medical School, University of Cyprus, Nicosia, Cyprus, 2Department of Cardiology, Nicosia General Hospital, Nicosia, Cyprus, 3Humanitas University, Pieve Emanuele, Italy.
1Medical School, University of Cyprus, Nicosia, Cyprus, 2Department of Cardiology, Nicosia General Hospital, Nicosia, Cyprus, 3Humanitas University, Pieve Emanuele, Italy.
OBJECTIVES: This study aimed to evaluate the efficiency and accuracy of AI models for screening of studies in an ongoing systematic review and meta-analysis. For this study, ChatGPT and DeepSeek were utilized.
METHODS: The AI models were used for the screening of articles regarding their inclusion in a systematic review and meta-analysis on the use of Glucagon-Like Peptide-1 Receptor Agonists (GLP-1 RAs) in Heart Failure treatment. The screening phase was divided into 2 parts: a) the manual screening based on predefined criteria and b) the AI-assisted screening where those criteria were used to make prompts for the AI models. Specifically, for a study to be considered eligible, it had to be a Randomized Controlled Trial (RCT) investigating GLP-1 RAs, involving patients diagnosed with heart failure, and reporting at least one relevant predetermined outcome. Furthermore, the time taken for each response from the AI models was recorded. Of note, the AI models had access to only publicly available information from the web.
RESULTS: Among the 48 eligible studies, ChatGPT correctly identified 35 (72.91%) (including studies labelled as partially eligible), while DeepSeek correctly identified 31 (64.58%). Among the 230 non-eligible studies, ChatGPT correctly classified 195 (84.78%) as not-suitable, while DeepSeek correctly classified 201 (87.39%). As a result, and in total, ChatGPT correctly identified 230 studies (82.73%), while DeepSeek correctly identified 232 studies (83.45%). Regarding time efficiency, the average time to evaluate the eligibility for ChatGPT was 25.87 seconds and for DeepSeek 32.80 seconds.
CONCLUSIONS: This study highlights the potential of AI models in supporting systematic reviews for the screening phase, emphasizing DeepSeek’s slightly better accuracy and ChatGPT’s better speed.
METHODS: The AI models were used for the screening of articles regarding their inclusion in a systematic review and meta-analysis on the use of Glucagon-Like Peptide-1 Receptor Agonists (GLP-1 RAs) in Heart Failure treatment. The screening phase was divided into 2 parts: a) the manual screening based on predefined criteria and b) the AI-assisted screening where those criteria were used to make prompts for the AI models. Specifically, for a study to be considered eligible, it had to be a Randomized Controlled Trial (RCT) investigating GLP-1 RAs, involving patients diagnosed with heart failure, and reporting at least one relevant predetermined outcome. Furthermore, the time taken for each response from the AI models was recorded. Of note, the AI models had access to only publicly available information from the web.
RESULTS: Among the 48 eligible studies, ChatGPT correctly identified 35 (72.91%) (including studies labelled as partially eligible), while DeepSeek correctly identified 31 (64.58%). Among the 230 non-eligible studies, ChatGPT correctly classified 195 (84.78%) as not-suitable, while DeepSeek correctly classified 201 (87.39%). As a result, and in total, ChatGPT correctly identified 230 studies (82.73%), while DeepSeek correctly identified 232 studies (83.45%). Regarding time efficiency, the average time to evaluate the eligibility for ChatGPT was 25.87 seconds and for DeepSeek 32.80 seconds.
CONCLUSIONS: This study highlights the potential of AI models in supporting systematic reviews for the screening phase, emphasizing DeepSeek’s slightly better accuracy and ChatGPT’s better speed.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR93
Topic
Epidemiology & Public Health, Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
Cardiovascular Disorders (including MI, Stroke, Circulatory)