Expediting Evidence Synthesis: A Review of Recent Research Evaluating Artificial Intelligence's Performance in Evidence Synthesis and Summarization of Clinical Literature
Author(s)
Fadi Manuel, PharmD, MPH1, Rachel M. Black, PharmD, MS1, Tyler Reinsch, PharmD2, Jiawei Chen, PharmD1, Danny Yeh, PhD1;
1AESARA Inc., Chapel Hill, NC, USA, 2Arysana Inc., Chapel Hill, NC, USA
1AESARA Inc., Chapel Hill, NC, USA, 2Arysana Inc., Chapel Hill, NC, USA
Presentation Documents
OBJECTIVES: Artificial intelligence (AI) has the potential to expedite the evidence synthesis process for health economics and outcomes researchers. This study aims to provide an overview of published evidence on the use of AI/large language model (LLM) tools for the synthesis of clinical literature.
METHODS: A literature review was conducted in EMBASE for articles published since 2022 that describe the performance of LLM tools in clinical literature synthesis. Additional articles were identified through citation searching and supplemental desktop research on arXiv. Information on the LLM tool utilized, the type of evidence being synthesized, and the method for evaluating the tool’s performance was extracted.
RESULTS: Overall, 8 studies were identified with a total of 15 AI tools utilized. Multiple studies used GPT-4 (n=5), GPT-3.5 (n=4), Gemini (n=2), and Claude 2 (n=2). Most of the studies (n=6) assessed evidence summarization; 2 studies assessed the extraction of pre-specified data. Types of literature being synthesized included systematic reviews (n=3), randomized controlled trials (n=3), clinical trial reports (n=1), and review articles (n=1). Three studies utilized automatic evaluation software such as ROUGE, BLEU, and METEOR, while the remaining studies developed their own assessments. Assessments focused on accuracy, comprehensiveness, and missing data. Human evaluations of accuracy (n=4) found the following: Claude 2 72%-96%, GPT-4 69%-89%, and Gemini 45%-76%. A study that assessed open-source and proprietary models showed that proprietary models were superior, with higher comprehensiveness scores.
CONCLUSIONS: This review identified GPT-4 as the most frequently tested LLM. Claude 2 demonstrated the highest accuracy, but this was observed in studies assessing the accuracy of pre-specified data extraction only. A standardized assessment checklist may be useful to appropriately appraise the performance of LLM tools in evidence synthesis.
METHODS: A literature review was conducted in EMBASE for articles published since 2022 that describe the performance of LLM tools in clinical literature synthesis. Additional articles were identified through citation searching and supplemental desktop research on arXiv. Information on the LLM tool utilized, the type of evidence being synthesized, and the method for evaluating the tool’s performance was extracted.
RESULTS: Overall, 8 studies were identified with a total of 15 AI tools utilized. Multiple studies used GPT-4 (n=5), GPT-3.5 (n=4), Gemini (n=2), and Claude 2 (n=2). Most of the studies (n=6) assessed evidence summarization; 2 studies assessed the extraction of pre-specified data. Types of literature being synthesized included systematic reviews (n=3), randomized controlled trials (n=3), clinical trial reports (n=1), and review articles (n=1). Three studies utilized automatic evaluation software such as ROUGE, BLEU, and METEOR, while the remaining studies developed their own assessments. Assessments focused on accuracy, comprehensiveness, and missing data. Human evaluations of accuracy (n=4) found the following: Claude 2 72%-96%, GPT-4 69%-89%, and Gemini 45%-76%. A study that assessed open-source and proprietary models showed that proprietary models were superior, with higher comprehensiveness scores.
CONCLUSIONS: This review identified GPT-4 as the most frequently tested LLM. Claude 2 demonstrated the highest accuracy, but this was observed in studies assessing the accuracy of pre-specified data extraction only. A standardized assessment checklist may be useful to appropriately appraise the performance of LLM tools in evidence synthesis.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
SA20
Topic
Study Approaches
Topic Subcategory
Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas