ACCELERATING LITERATURE REVIEWS WITH LARGE LANGUAGE MODELS (LLMS): AN EVALUATION OF PERFORMANCE AND EFFICIENCY
Author(s)
Raju Gautam, PhD1, Saeed Anwar, MSc2, Tushar Srivastava, MSc1, Ratna Pandey, MSc2.
1ConnectHEOR, London, United Kingdom, 2ConnectHEOR, Delhi, India.
1ConnectHEOR, London, United Kingdom, 2ConnectHEOR, Delhi, India.
Presentation Documents
OBJECTIVES: Systematic/Literature reviews (SLRs/LRs) are crucial for health research and evidence-based decision making but are often time-and labor-intensive. Artificial intelligence (AI) tools like LLMs have shown promising ways to automate these processes. The aim of this research was to evaluate the performance and efficiency of an AI-SLR tool.
METHODS: A retrospective analysis was conducted to evaluate the performance and efficiency of a web-based AI-SLR tool (EasySLR™) across four LRs (2 targeted, 2 SLRs; 2 clinical, 2 economic). AI performance (accuracy, sensitivity and specificity) was assessed for title/abstract screening, full-text screening and data-extraction. An AI-only approach was used for title/abstract screening in targeted reviews and for data-extraction in all reviews, while a hybrid AI-human reviewer approach was applied for all other review stages. AI-only and AI-human hybrid performance were compared with retrospectively completed human-only reviews.
RESULTS: Sample size comprised 794−1,594 studies (title/abstract screening), 12−92 (full-text screening), and 5−92 (data extraction). Across all four LRs, AI-human accuracy ranged from 84%-100% for title/abstract screening, 60%-92% for full-text screening and 9%-60% for data-extraction. Sensitivity (correct inclusion by AI) varied from 70%-97% for title/abstract screening and 90-100% for full-text screening. Specificity (correct exclusion by AI) ranged from 84%-100% for title/abstract and 70%-88% for full-text screening. Performance for clinical review was considerably poorer versus economic review. Compared to human-only LRs, AI-only reviewers improve efficiency by 100%-150% for title/abstract screening, and 300%-500% for data-extraction but with low accuracy. Whereas a hybrid approach improves efficiency by 40%-60% for title/abstract screening and 12%-20% for full-text screening.
CONCLUSIONS: The used AI-SLR tool appears to be a promising tool for straightforward reviews, and it saved considerable time in title/abstract screening and data-extraction using AI-only reviewer feature. The performance for complex reviews and data-extraction requires further improvements. Nevertheless, ongoing and future model developments may improve suitability for data-extraction and complex reviews.
METHODS: A retrospective analysis was conducted to evaluate the performance and efficiency of a web-based AI-SLR tool (EasySLR™) across four LRs (2 targeted, 2 SLRs; 2 clinical, 2 economic). AI performance (accuracy, sensitivity and specificity) was assessed for title/abstract screening, full-text screening and data-extraction. An AI-only approach was used for title/abstract screening in targeted reviews and for data-extraction in all reviews, while a hybrid AI-human reviewer approach was applied for all other review stages. AI-only and AI-human hybrid performance were compared with retrospectively completed human-only reviews.
RESULTS: Sample size comprised 794−1,594 studies (title/abstract screening), 12−92 (full-text screening), and 5−92 (data extraction). Across all four LRs, AI-human accuracy ranged from 84%-100% for title/abstract screening, 60%-92% for full-text screening and 9%-60% for data-extraction. Sensitivity (correct inclusion by AI) varied from 70%-97% for title/abstract screening and 90-100% for full-text screening. Specificity (correct exclusion by AI) ranged from 84%-100% for title/abstract and 70%-88% for full-text screening. Performance for clinical review was considerably poorer versus economic review. Compared to human-only LRs, AI-only reviewers improve efficiency by 100%-150% for title/abstract screening, and 300%-500% for data-extraction but with low accuracy. Whereas a hybrid approach improves efficiency by 40%-60% for title/abstract screening and 12%-20% for full-text screening.
CONCLUSIONS: The used AI-SLR tool appears to be a promising tool for straightforward reviews, and it saved considerable time in title/abstract screening and data-extraction using AI-only reviewer feature. The performance for complex reviews and data-extraction requires further improvements. Nevertheless, ongoing and future model developments may improve suitability for data-extraction and complex reviews.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR190
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas