Can Artificial Intelligence (AI) Accurately Screen Abstracts in Systematic Literature Reviews?
Author(s)
Gultyaev D1, Hummel N2, Koller L1, Stern L3, Kadambi A4, de Moor C3
1Certara GmbH, Lörrach, Germany, 2Certara GmbH, Lörrach, BW, Germany, 3Certara Inc., Princeton, NJ, USA, 4Certara Inc., San Mateo, CA, USA
Presentation Documents
OBJECTIVES: One of the most time-consuming tasks in systematic literature reviews (SLRs) is the screening of abstracts according to PICOS (Population, Intervention, Comparator, Outcome, Study design) criteria. In this study, we aimed to assess the accuracy and potential efficiency gains of ChatGPT4-supported abstract screening in SLRs.
METHODS: We evaluated ChatGPT4’s performance in screening abstracts for previously performed SLRs in two indications: achondroplasia and advanced metastatic renal cell carcinoma (RCC). Possible screening decisions were definite inclusion (‘Yes’), exclusion (‘No’), or ‘Unclear’, where ChatGPT4 was unable to decide due to insufficient information provided in the abstract. ChatGPT4 and human decisions (‘Yes’/’No’) were compared, and precision, recall, F1 score (harmonic mean of precision and recall), and specificity were calculated. Assuming abstracts with a definite answer of ChatGPT4 will not require human verification, the proportion of abstracts with definite answers compared to the total number of abstracts screened was calculated to estimate maximum time savings of AI-supported abstract screening.
RESULTS: Among the 179 abstracts screened for achondroplasia, 38 were categorized as ’Unclear’. For the remaining 141 abstracts, precision was 0.81, recall was 0.95, F1 score was 0.88, accuracy was 0.91, and specificity was 0.90. Maximum time savings amounted to 79%. Among the 551 abstracts screened for RCC, 83 were categorized as ’Unclear’. For the remaining 468 abstracts, precision was 0.72, recall was 0.73, F1 score was 0.72, accuracy was 0.87, and specificity was 0.91. Maximum time savings amounted to 85%.
CONCLUSIONS: AI tools such as ChatGPT4 are accurate (F1 > 0.7) and highly specific (specificity > 0.9) in abstract screening. Additionally, they offer considerable potential time savings and can be used to rapidly assess available evidence for multiple applications outside of health technology assessment (HTA) submissions, e.g., in epidemiology, manuscript preparation, competitive intelligence, and maintenance of living SLRs.
Conference/Value in Health Info
Value in Health, Volume 27, Issue 12, S2 (December 2024)
Code
MSR219
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas