The Use of Large Language Models for Systematic Literature Review Automation: An Evaluation of Quality and Time Savings
Author(s)
Ryan Thaliffdeen, BS, MS, PharmD1, Meelis Lootus, PhD2, Iradj Reza, MSc, PhD3, Carrie Nielson, PhD4, Lulu Zhao Beatson, BA2, Harriet A. Dickinson5.
1Gilead Sciences, Foster City, CA, USA, 2Tehistark, London, United Kingdom, 3Gilead Sciences, Uxbridge, United Kingdom, 4Gilead Sciences, Salem, MA, USA, 5Director, Gilead, Stockley Park, United Kingdom.
1Gilead Sciences, Foster City, CA, USA, 2Tehistark, London, United Kingdom, 3Gilead Sciences, Uxbridge, United Kingdom, 4Gilead Sciences, Salem, MA, USA, 5Director, Gilead, Stockley Park, United Kingdom.
OBJECTIVES: Large language models (LLMs) have shown promise in assisting with many aspects of systematic literature reviews (SLRs). We evaluated the quality and time savings of automating steps within the SLR process, including generating written summaries.
METHODS: We compared a human-made existing SLR to a fully-automated SLR (‘FullAutoSLR’). conducted on the AutoSLR platform. LLM performance was evaluated across four SLR activities (search, screening, data extraction, report writing) and time savings were estimated. Report writing was iterated using human-in-the-loop feedback, and the final report was assessed qualitatively.
RESULTS: AutoSLR was able to produce an end-to-end SLR fully automatically and facilitate human input for improvements. The initial search query was evaluated as basic and lacked some relevant indexing terms. In title-abstract screening, the FullAutoSLR had an accuracy of 72% (precision=62.5%, recall=55.6%, F1-score=58.8%) compared to a human-made review. In the extraction task, good/perfect agreement was achieved by the FullAutoSLR in 71.6% of extractions. Performance was best for treatment characteristics (86.7% good/perfect) and worst for patient characteristics (54.7% good/perfect).The initial FullAutoSLR written report lacked some common SLR sections, a PRISMA diagram, and overall depth. The final report partially met the needs of the initial research question. While the report contained a strong introduction and methods section, results and discussion sections were too high-level. Compared to human-only SLRs, the synergistic AI-generated, human-review approach was 20.3% faster for title-abstract screening, 61.8% for full-text screening, and 55.6% for data extraction.
CONCLUSIONS: LLMs can conduct SLRs end-to-end but require human input to achieve high quality. Human feedback was valuable in terms of tailoring the final report to stakeholder expectations. The time savings associated with using LLMs in SLRs were substantial and shifted human tasks from execution-based to review-based activities. This work indicates the key role that AI can play in reducing human time spent on SLR processes.
METHODS: We compared a human-made existing SLR to a fully-automated SLR (‘FullAutoSLR’). conducted on the AutoSLR platform. LLM performance was evaluated across four SLR activities (search, screening, data extraction, report writing) and time savings were estimated. Report writing was iterated using human-in-the-loop feedback, and the final report was assessed qualitatively.
RESULTS: AutoSLR was able to produce an end-to-end SLR fully automatically and facilitate human input for improvements. The initial search query was evaluated as basic and lacked some relevant indexing terms. In title-abstract screening, the FullAutoSLR had an accuracy of 72% (precision=62.5%, recall=55.6%, F1-score=58.8%) compared to a human-made review. In the extraction task, good/perfect agreement was achieved by the FullAutoSLR in 71.6% of extractions. Performance was best for treatment characteristics (86.7% good/perfect) and worst for patient characteristics (54.7% good/perfect).The initial FullAutoSLR written report lacked some common SLR sections, a PRISMA diagram, and overall depth. The final report partially met the needs of the initial research question. While the report contained a strong introduction and methods section, results and discussion sections were too high-level. Compared to human-only SLRs, the synergistic AI-generated, human-review approach was 20.3% faster for title-abstract screening, 61.8% for full-text screening, and 55.6% for data extraction.
CONCLUSIONS: LLMs can conduct SLRs end-to-end but require human input to achieve high quality. Human feedback was valuable in terms of tailoring the final report to stakeholder expectations. The time savings associated with using LLMs in SLRs were substantial and shifted human tasks from execution-based to review-based activities. This work indicates the key role that AI can play in reducing human time spent on SLR processes.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR41
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas