AUTOMATION OF SYSTEMATIC REVIEWS WITH LARGE LANGUAGE MODELS
Author(s)
Christian Cao, MD;
University of Toronto, Toronto, ON, Canada
University of Toronto, Toronto, ON, Canada
OBJECTIVES: Systematic reviews (SRs) inform evidence-based decision making. Yet, they take over a year to complete, are labor intensive, prone to human error, and face challenges with reproducibility; limiting access to timely and reliable information.
Objective: To develop and validate a large language model (LLM)-based workflow (otto-SR) to automate the two most labour intensive tasks in performing SR’s: article screening and data extraction; and to assess its feasibility in rapidly updating existing reviews.
METHODS: We conducted a validation study in three phases, with direct benchmarking against graduate-level human researchers in phases 1 and 2. Phase 1: article screening performance was evaluated across 32,357 citations from 5 systematic reviews. Phase 2: data extraction performance was evaluated across 4,495 data points from 7 reviews. Phase 3: otto-SR was used to reproduce and update a complete issue of Cochrane reviews (n=12 reviews), with analytical comparisons to the original meta-analyzed findings.
RESULTS: In the first 2 phases, otto-SR outperformed traditional dual human workflows in article screening (otto-SR: 96.7% sensitivity, 97.9% specificity; human: 81.7% sensitivity, 98.1% specificity) and data extraction (otto-SR: 93.1% accuracy; human: 79.7% accuracy). In phase 3, otto-SR, reproduced and updated an entire issue of Cochrane reviews (n=12, 146,276 citations) in two days, representing approximately 12 work-years of traditional systematic review work. Across Cochrane reviews, otto-SR incorrectly excluded a median of 0 studies (IQR 0 to 0.25), and found nearly twice as many eligible studies compared to the original authors (n= 114 vs. 64). Meta-analyses revealed that otto-SR generated newly statistically significant findings in 2 reviews and negated significance in 1 review.
CONCLUSIONS: These findings demonstrate that LLMs can rapidly conduct and update systematic reviews with superhuman performance, laying the foundation for automated, scalable, and reliable evidence synthesis.
Objective: To develop and validate a large language model (LLM)-based workflow (otto-SR) to automate the two most labour intensive tasks in performing SR’s: article screening and data extraction; and to assess its feasibility in rapidly updating existing reviews.
METHODS: We conducted a validation study in three phases, with direct benchmarking against graduate-level human researchers in phases 1 and 2. Phase 1: article screening performance was evaluated across 32,357 citations from 5 systematic reviews. Phase 2: data extraction performance was evaluated across 4,495 data points from 7 reviews. Phase 3: otto-SR was used to reproduce and update a complete issue of Cochrane reviews (n=12 reviews), with analytical comparisons to the original meta-analyzed findings.
RESULTS: In the first 2 phases, otto-SR outperformed traditional dual human workflows in article screening (otto-SR: 96.7% sensitivity, 97.9% specificity; human: 81.7% sensitivity, 98.1% specificity) and data extraction (otto-SR: 93.1% accuracy; human: 79.7% accuracy). In phase 3, otto-SR, reproduced and updated an entire issue of Cochrane reviews (n=12, 146,276 citations) in two days, representing approximately 12 work-years of traditional systematic review work. Across Cochrane reviews, otto-SR incorrectly excluded a median of 0 studies (IQR 0 to 0.25), and found nearly twice as many eligible studies compared to the original authors (n= 114 vs. 64). Meta-analyses revealed that otto-SR generated newly statistically significant findings in 2 reviews and negated significance in 1 review.
CONCLUSIONS: These findings demonstrate that LLMs can rapidly conduct and update systematic reviews with superhuman performance, laying the foundation for automated, scalable, and reliable evidence synthesis.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR79
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas