Expediting Evidence Synthesis: A Review of Recent Research Evaluating Artificial Intelligence's Performance in Evidence Synthesis and Summarization of Clinical Literature

Moderator

Fadi Manuel, BS, PharmD, AESARA, ClintonTownship, MI, United States

Speakers

Rachel M Black, PharmD, AESARA, Austin, TX, United States; Tyler Reinsch, PharmD, Arysana, Springfield, MO, United States; Jiawei Chen; Danny Yeh, PhD, Aesara, Burlingame, CA, United States

OBJECTIVES: Artificial intelligence (AI) has the potential to expedite the evidence synthesis process for health economics and outcomes researchers. This study aims to provide an overview of published evidence on the use of AI/large language model (LLM) tools for the synthesis of clinical literature.
METHODS: A literature review was conducted in EMBASE for articles published since 2022 that describe the performance of LLM tools in clinical literature synthesis. Additional articles were identified through citation searching and supplemental desktop research on arXiv. Information on the LLM tool utilized, the type of evidence being synthesized, and the method for evaluating the tool’s performance was extracted.
RESULTS: Overall, 8 studies were identified with a total of 15 AI tools utilized. Multiple studies used GPT-4 (n=5), GPT-3.5 (n=4), Gemini (n=2), and Claude 2 (n=2). Most of the studies (n=6) assessed evidence summarization; 2 studies assessed the extraction of pre-specified data. Types of literature being synthesized included systematic reviews (n=3), randomized controlled trials (n=3), clinical trial reports (n=1), and review articles (n=1). Three studies utilized automatic evaluation software such as ROUGE, BLEU, and METEOR, while the remaining studies developed their own assessments. Assessments focused on accuracy, comprehensiveness, and missing data. Human evaluations of accuracy (n=4) found the following: Claude 2 72%-96%, GPT-4 69%-89%, and Gemini 45%-76%. A study that assessed open-source and proprietary models showed that proprietary models were superior, with higher comprehensiveness scores.
CONCLUSIONS: This review identified GPT-4 as the most frequently tested LLM. Claude 2 demonstrated the highest accuracy, but this was observed in studies assessing the accuracy of pre-specified data extraction only. A standardized assessment checklist may be useful to appropriately appraise the performance of LLM tools in evidence synthesis.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

SA20

Topic

Study Approaches

Topic Subcategory

Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)