Advancing Systematic Literature Reviews: A Comparative Analysis of Large Language Models (Claude Sonnet 3.5, Gemini Flash 1.5, and GPT-4) in the Automation Era of Generative AI
Author(s)
Rai P1, Pandey S1, Attri S2, Singh B3, Kaur R1
1Pharmacoevidence, Mohali, India, 2Pharmacoevidence, Mohali, PB, India, 3Pharmacoevidence, SAS Nagar Mohali, PB, India
Presentation Documents
OBJECTIVES: In recent years, the advent of large language models (LLMs), such as Claude Sonnet 3.5, Gemini Flash 1.5, and GPT-4, has revolutionized the traditional approach of conducting SLRs. These models exhibit diverse capabilities in comprehending and synthesizing the vast volumes of literature, offering potential efficiency gains and novel insights. Understanding their comparative efficiency is essential for discovering the optimal tool in the evolving landscape of AI-driven literature analysis. This research investigates the relative efficiency of the generative AI models in swiftly reviewing publications for systematic literature reviews (SLRs).
METHODS: Embase®, Medline®, and Cochrane were searched to identify relevant randomized controlled trials (RCTs) in the disease area of interest. A subject matter expert with over a decade of domain knowledge optimized and fine-tuned the final prompt, using the Python FastAPI to identify evidence meeting the eligibility criteria. Title and abstract- based screening were conducted using three different AI tools to evaluate their efficiency in identifying eligible publications.
RESULTS: Overall, all three AI models performed exceptionally well in screening based on titles and abstracts. While there were no significant differences in accuracy rates, Gemini Flash 1.5 exhibited the highest accuracy rate at 96.02%, followed by GPT-4 at 95.00%, and Claude Sonnet 3.5 at 94.69%. In terms of sensitivity, GPT-4 suggested better results attaining 95.97% of sensitivity followed by 94.63% with Gemini Flash 1.5 and 88.59% with Claude Sonnet 3.5.
CONCLUSIONS: The study highlights the comparative effectiveness of the three AI models under investigation. Practically, attaining 96.02% accuracy with two-review human process is challenging; however, the relative efficacy of Gemini Flash 1.5 over other LLMs in this study offers a viable substitute to revolutionize the screening approach. Future investigations should further explore these capabilities and their application across diverse research domains.
Conference/Value in Health Info
Value in Health, Volume 27, Issue 12, S2 (December 2024)
Code
MSR78
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas