Assessing the Effectiveness of Large Language Models in Automating Systematic Literature Reviews: Findings from Recent Studies

Author(s)

Sumeyye Samur, PhD¹, Bhakti Mody, MS¹, Rachael Fleurence, MSc, PhD², Elif Bayraktar, BS¹, Turgay Ayer, PhD³, Jag Chhatwal, PhD⁴;
¹Value Analytics Labs, Boston, MA, USA, ²National Institutes of Health, DC, DC, USA, ³Georgia Institute of Technology, Atlanta, GA, USA, ⁴Massachusetts General Hospital Institute for Technology Assessment, Harvard Medical School, Boston, MA, USA

Presentation Documents

Review of AI-integrated SLR Studies_ISPOR US 2025_Poster_5.7.2025_Printed.pdf

OBJECTIVES: Large language models (LLMs) can automate different steps of systematic literature reviews (SLRs); however, their performance across different SLR tasks is not well documented. Our objective was to conduct a review of the performance of various LLMs in key tasks of SLRs.
METHODS: We conducted a targeted literature review of SLRs conducted using LLMs. We identified and reviewed 24 studies conducted between January 2023- January 2025 across fourteen countries. These studies assessed the performance of LLMs such as GPT-4 Turbo, ChatGPT v4.0, GPT-3.5 Turbo, Claude 2, and GPT-4, across multiple SLR tasks, including title and abstract screening, full-text screening, data extraction and bias assessment. Accuracy, sensitivity, specificity, and reliability were used to evaluate the performance of each study.
RESULTS: The performance of LLMs varied across tasks and models. For title screening, GPT-4 Turbo and GPT-3.5 Turbo demonstrated high sensitivity (up to 99.5%) and specificity (up to 99.6%), effectively identifying relevant titles. For abstract screening, ChatGPT v4.0 achieved 94.5% accuracy and 93% sensitivity, while GPT-3.5 Turbo had 99.5% sensitivity but low specificity (2.2%). In data extraction, Claude 2 achieved 96.2% accuracy and GPT-4 achieved 68.8% accuracy. With PDF-reading plugin, the accuracy of corresponding tools increased to 98.7% and 68.8%, respectively. For bias assessment, GPT-4 attained a kappa score of 0.90 for abstract screening but showed greater disagreement with human reviewers during full-text review (kappa score 0.65). While the time to conduct SLR reduced, none of the studies reported the magnitude of reduction.
CONCLUSIONS: LLMs exhibit promising capabilities in automating SLR tasks, particularly in abstract screening and data extraction, potentially reducing time and cost. However, need for human involvement is necessary at this stage. Future research should establish acceptable benchmarks for acceptance of LLM-generated SLRs.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR155

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)