The Use of Large Language Models for Systematic Literature Review Automation: An Evaluation of Quality and Time Savings

Author(s)

Ryan Thaliffdeen, BS, MS, PharmD1, Meelis Lootus, PhD2, Iradj Reza, MSc, PhD3, Carrie Nielson, PhD4, Lulu Zhao Beatson, BA2, Harriet A. Dickinson5.
1Gilead Sciences, Foster City, CA, USA, 2Tehistark, London, United Kingdom, 3Gilead Sciences, Uxbridge, United Kingdom, 4Gilead Sciences, Salem, MA, USA, 5Director, Gilead, Stockley Park, United Kingdom.
OBJECTIVES: Large language models (LLMs) have shown promise in assisting with many aspects of systematic literature reviews (SLRs). We evaluated the quality and time savings of automating steps within the SLR process, including generating written summaries.
METHODS: We compared a human-made existing SLR to a fully-automated SLR (‘FullAutoSLR’). conducted on the AutoSLR platform. LLM performance was evaluated across four SLR activities (search, screening, data extraction, report writing) and time savings were estimated. Report writing was iterated using human-in-the-loop feedback, and the final report was assessed qualitatively.
RESULTS: AutoSLR was able to produce an end-to-end SLR fully automatically and facilitate human input for improvements. The initial search query was evaluated as basic and lacked some relevant indexing terms. In title-abstract screening, the FullAutoSLR had an accuracy of 72% (precision=62.5%, recall=55.6%, F1-score=58.8%) compared to a human-made review. In the extraction task, good/perfect agreement was achieved by the FullAutoSLR in 71.6% of extractions. Performance was best for treatment characteristics (86.7% good/perfect) and worst for patient characteristics (54.7% good/perfect).The initial FullAutoSLR written report lacked some common SLR sections, a PRISMA diagram, and overall depth. The final report partially met the needs of the initial research question. While the report contained a strong introduction and methods section, results and discussion sections were too high-level. Compared to human-only SLRs, the synergistic AI-generated, human-review approach was 20.3% faster for title-abstract screening, 61.8% for full-text screening, and 55.6% for data extraction.
CONCLUSIONS: LLMs can conduct SLRs end-to-end but require human input to achieve high quality. Human feedback was valuable in terms of tailoring the final report to stakeholder expectations. The time savings associated with using LLMs in SLRs were substantial and shifted human tasks from execution-based to review-based activities. This work indicates the key role that AI can play in reducing human time spent on SLR processes.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR41

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×