Assessing the Effectiveness of Large Language Models in Automating Systematic Literature Reviews: Findings from Recent Studies
Author(s)
Sumeyye Samur, PhD1, Bhakti Mody, MS1, Rachael Fleurence, MSc, PhD2, Elif Bayraktar, BS1, Turgay Ayer, PhD3, Jag Chhatwal, PhD4;
1Value Analytics Labs, Boston, MA, USA, 2National Institutes of Health, DC, DC, USA, 3Georgia Institute of Technology, Atlanta, GA, USA, 4Massachusetts General Hospital Institute for Technology Assessment, Harvard Medical School, Boston, MA, USA
1Value Analytics Labs, Boston, MA, USA, 2National Institutes of Health, DC, DC, USA, 3Georgia Institute of Technology, Atlanta, GA, USA, 4Massachusetts General Hospital Institute for Technology Assessment, Harvard Medical School, Boston, MA, USA
OBJECTIVES: Large language models (LLMs) can automate different steps of systematic literature reviews (SLRs); however, their performance across different SLR tasks is not well documented. Our objective was to conduct a review of the performance of various LLMs in key tasks of SLRs.
METHODS: We conducted a targeted literature review of SLRs conducted using LLMs. We identified and reviewed 24 studies conducted between January 2023- January 2025 across fourteen countries. These studies assessed the performance of LLMs such as GPT-4 Turbo, ChatGPT v4.0, GPT-3.5 Turbo, Claude 2, and GPT-4, across multiple SLR tasks, including title and abstract screening, full-text screening, data extraction and bias assessment. Accuracy, sensitivity, specificity, and reliability were used to evaluate the performance of each study.
RESULTS: The performance of LLMs varied across tasks and models. For title screening, GPT-4 Turbo and GPT-3.5 Turbo demonstrated high sensitivity (up to 99.5%) and specificity (up to 99.6%), effectively identifying relevant titles. For abstract screening, ChatGPT v4.0 achieved 94.5% accuracy and 93% sensitivity, while GPT-3.5 Turbo had 99.5% sensitivity but low specificity (2.2%). In data extraction, Claude 2 achieved 96.2% accuracy and GPT-4 achieved 68.8% accuracy. With PDF-reading plugin, the accuracy of corresponding tools increased to 98.7% and 68.8%, respectively. For bias assessment, GPT-4 attained a kappa score of 0.90 for abstract screening but showed greater disagreement with human reviewers during full-text review (kappa score 0.65). While the time to conduct SLR reduced, none of the studies reported the magnitude of reduction.
CONCLUSIONS: LLMs exhibit promising capabilities in automating SLR tasks, particularly in abstract screening and data extraction, potentially reducing time and cost. However, need for human involvement is necessary at this stage. Future research should establish acceptable benchmarks for acceptance of LLM-generated SLRs.
METHODS: We conducted a targeted literature review of SLRs conducted using LLMs. We identified and reviewed 24 studies conducted between January 2023- January 2025 across fourteen countries. These studies assessed the performance of LLMs such as GPT-4 Turbo, ChatGPT v4.0, GPT-3.5 Turbo, Claude 2, and GPT-4, across multiple SLR tasks, including title and abstract screening, full-text screening, data extraction and bias assessment. Accuracy, sensitivity, specificity, and reliability were used to evaluate the performance of each study.
RESULTS: The performance of LLMs varied across tasks and models. For title screening, GPT-4 Turbo and GPT-3.5 Turbo demonstrated high sensitivity (up to 99.5%) and specificity (up to 99.6%), effectively identifying relevant titles. For abstract screening, ChatGPT v4.0 achieved 94.5% accuracy and 93% sensitivity, while GPT-3.5 Turbo had 99.5% sensitivity but low specificity (2.2%). In data extraction, Claude 2 achieved 96.2% accuracy and GPT-4 achieved 68.8% accuracy. With PDF-reading plugin, the accuracy of corresponding tools increased to 98.7% and 68.8%, respectively. For bias assessment, GPT-4 attained a kappa score of 0.90 for abstract screening but showed greater disagreement with human reviewers during full-text review (kappa score 0.65). While the time to conduct SLR reduced, none of the studies reported the magnitude of reduction.
CONCLUSIONS: LLMs exhibit promising capabilities in automating SLR tasks, particularly in abstract screening and data extraction, potentially reducing time and cost. However, need for human involvement is necessary at this stage. Future research should establish acceptable benchmarks for acceptance of LLM-generated SLRs.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR155
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas