Can Artificial Intelligence (AI) Accurately Screen Abstracts in Systematic Literature Reviews?

Author(s)

Gultyaev D¹, Hummel N², Koller L¹, Stern L³, Kadambi A⁴, de Moor C³
¹Certara GmbH, Lörrach, Germany, ²Certara GmbH, Lörrach, BW, Germany, ³Certara Inc., Princeton, NJ, USA, ⁴Certara Inc., San Mateo, CA, USA

Presentation Documents

ISPOREurope24_Hummel_MSR219_AI for abstract selection144043.pdf

OBJECTIVES: One of the most time-consuming tasks in systematic literature reviews (SLRs) is the screening of abstracts according to PICOS (Population, Intervention, Comparator, Outcome, Study design) criteria. In this study, we aimed to assess the accuracy and potential efficiency gains of ChatGPT4-supported abstract screening in SLRs.

METHODS: We evaluated ChatGPT4’s performance in screening abstracts for previously performed SLRs in two indications: achondroplasia and advanced metastatic renal cell carcinoma (RCC). Possible screening decisions were definite inclusion (‘Yes’), exclusion (‘No’), or ‘Unclear’, where ChatGPT4 was unable to decide due to insufficient information provided in the abstract. ChatGPT4 and human decisions (‘Yes’/’No’) were compared, and precision, recall, F1 score (harmonic mean of precision and recall), and specificity were calculated. Assuming abstracts with a definite answer of ChatGPT4 will not require human verification, the proportion of abstracts with definite answers compared to the total number of abstracts screened was calculated to estimate maximum time savings of AI-supported abstract screening.

RESULTS: Among the 179 abstracts screened for achondroplasia, 38 were categorized as ’Unclear’. For the remaining 141 abstracts, precision was 0.81, recall was 0.95, F1 score was 0.88, accuracy was 0.91, and specificity was 0.90. Maximum time savings amounted to 79%. Among the 551 abstracts screened for RCC, 83 were categorized as ’Unclear’. For the remaining 468 abstracts, precision was 0.72, recall was 0.73, F1 score was 0.72, accuracy was 0.87, and specificity was 0.91. Maximum time savings amounted to 85%.

CONCLUSIONS: AI tools such as ChatGPT4 are accurate (F1 > 0.7) and highly specific (specificity > 0.9) in abstract screening. Additionally, they offer considerable potential time savings and can be used to rapidly assess available evidence for multiple applications outside of health technology assessment (HTA) submissions, e.g., in epidemiology, manuscript preparation, competitive intelligence, and maintenance of living SLRs.

Conference/Value in Health Info

2024-11, ISPOR Europe 2024, Barcelona, Spain

Value in Health, Volume 27, Issue 12, S2 (December 2024)

Code

MSR219

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic

Methodology

Presentation