Advancing Systematic Literature Reviews: A Comparative Analysis of Large Language Models (Claude Sonnet 3.5, Gemini Flash 1.5, and GPT-4) in the Automation Era of Generative AI

Author(s)

Rai P¹, Pandey S¹, Attri S², Singh B³, Kaur R¹
¹Pharmacoevidence, Mohali, India, ²Pharmacoevidence, Mohali, PB, India, ³Pharmacoevidence, SAS Nagar Mohali, PB, India

Presentation Documents

MSR78-Comparative Analysis Claude Sonnet 3.5 Gemini Flash 1.5 and GPT-4147202.pdf

OBJECTIVES: In recent years, the advent of large language models (LLMs), such as Claude Sonnet 3.5, Gemini Flash 1.5, and GPT-4, has revolutionized the traditional approach of conducting SLRs. These models exhibit diverse capabilities in comprehending and synthesizing the vast volumes of literature, offering potential efficiency gains and novel insights. Understanding their comparative efficiency is essential for discovering the optimal tool in the evolving landscape of AI-driven literature analysis. This research investigates the relative efficiency of the generative AI models in swiftly reviewing publications for systematic literature reviews (SLRs).

METHODS: Embase^®, Medline^®, and Cochrane were searched to identify relevant randomized controlled trials (RCTs) in the disease area of interest. A subject matter expert with over a decade of domain knowledge optimized and fine-tuned the final prompt, using the Python FastAPI to identify evidence meeting the eligibility criteria. Title and abstract- based screening were conducted using three different AI tools to evaluate their efficiency in identifying eligible publications.

RESULTS: Overall, all three AI models performed exceptionally well in screening based on titles and abstracts. While there were no significant differences in accuracy rates, Gemini Flash 1.5 exhibited the highest accuracy rate at 96.02%, followed by GPT-4 at 95.00%, and Claude Sonnet 3.5 at 94.69%. In terms of sensitivity, GPT-4 suggested better results attaining 95.97% of sensitivity followed by 94.63% with Gemini Flash 1.5 and 88.59% with Claude Sonnet 3.5.

CONCLUSIONS: The study highlights the comparative effectiveness of the three AI models under investigation. Practically, attaining 96.02% accuracy with two-review human process is challenging; however, the relative efficacy of Gemini Flash 1.5 over other LLMs in this study offers a viable substitute to revolutionize the screening approach. Future investigations should further explore these capabilities and their application across diverse research domains.

Conference/Value in Health Info

2024-11, ISPOR Europe 2024, Barcelona, Spain

Value in Health, Volume 27, Issue 12, S2 (December 2024)

Code

MSR78

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic

Methodology

Presentation