How Much Time Does Artificial Intelligence Really Save in Evidence Synthesis? A Systematic Literature Review
Author(s)
Ania Bobrowska, BSc, MSc, PhD1, Molly Murton2, Liz Ashworth, BA2, Kallista Chan, PhD2.
1Principal Consultant, Costello Medical, Cambridge, United Kingdom, 2Costello Medical, Cambridge, United Kingdom.
1Principal Consultant, Costello Medical, Cambridge, United Kingdom, 2Costello Medical, Cambridge, United Kingdom.
OBJECTIVES: Traditional literature reviews can be time and resource-intensive. We aimed to understand the extent to which AI can save time and workload in the conduct of literature reviews (LRs).
METHODS: MEDLINE and Embase were searched in June 2025. Records were reviewed at title and abstract by two experienced reviewers and at full text by a single reviewer. We included primary research studies that reported time or workload saved by using AI on a specific aspect of a LR compared with humans. LRs were hand-searched and excluded. Data was extracted and synthesised qualitatively due to heterogeneity of outcome reporting. Where possible, saved hours per study were calculated. Where ranges were reported, midpoints were used for calculations. Authors' conclusions were subjectively judged as "positive", "cautiously positive" or "neutral/negative" towards AI-generated efficiencies in LRs.
RESULTS: Searches produced 2,091 unique hits; 2,011 records were removed after title/abstract review. Ultimately, 56 studies were included. Studies used proprietary tools (n=29), widely-available general AI tools like ChatGPT (n=16) or a trained, bespoke algorithm (n=11). Most time savings were reported for study selection at title/abstract stage (n=45 studies), with fewer studies reporting time saved on quality assessments (n=6), extractions (n=2) or deduplication, feasibility assessment, or search strategy generation (n=1 each). Median time saved per study was 0.017 hours (n=31 data points). The median workload saved was 65% (n=25 data points) and the median time saved was 60% (n=8 data points). Authors were generally positive (n=27) or cautiously positive (n=17), rather than negative (n=12) about the potential of AI to help conduct LRs.
CONCLUSIONS: Most benefits of AI are currently seen at the screening stage of a LR, rather than data extraction or quality assessment stages. Comparisons are hampered by lack of a unified outcome to measure the performance of AI in LRs, both in terms of precision and efficiencies gained.
METHODS: MEDLINE and Embase were searched in June 2025. Records were reviewed at title and abstract by two experienced reviewers and at full text by a single reviewer. We included primary research studies that reported time or workload saved by using AI on a specific aspect of a LR compared with humans. LRs were hand-searched and excluded. Data was extracted and synthesised qualitatively due to heterogeneity of outcome reporting. Where possible, saved hours per study were calculated. Where ranges were reported, midpoints were used for calculations. Authors' conclusions were subjectively judged as "positive", "cautiously positive" or "neutral/negative" towards AI-generated efficiencies in LRs.
RESULTS: Searches produced 2,091 unique hits; 2,011 records were removed after title/abstract review. Ultimately, 56 studies were included. Studies used proprietary tools (n=29), widely-available general AI tools like ChatGPT (n=16) or a trained, bespoke algorithm (n=11). Most time savings were reported for study selection at title/abstract stage (n=45 studies), with fewer studies reporting time saved on quality assessments (n=6), extractions (n=2) or deduplication, feasibility assessment, or search strategy generation (n=1 each). Median time saved per study was 0.017 hours (n=31 data points). The median workload saved was 65% (n=25 data points) and the median time saved was 60% (n=8 data points). Authors were generally positive (n=27) or cautiously positive (n=17), rather than negative (n=12) about the potential of AI to help conduct LRs.
CONCLUSIONS: Most benefits of AI are currently seen at the screening stage of a LR, rather than data extraction or quality assessment stages. Comparisons are hampered by lack of a unified outcome to measure the performance of AI in LRs, both in terms of precision and efficiencies gained.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
SA50
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas