EVALUATING THE PERFORMANCE OF A GENERATIVE ARTIFICIAL INTELLIGENCE CHATBOT FOR ASSESSING SYSTEMATIC LITERATURE REVIEWS USING THE PREFERRED REPORTING ITEMS FOR SYSTEMATIC REVIEWS AND META-ANALYSES 2020 CHECKLIST

Author(s)

Maria Arregui, PhD¹, Evelyn Gomez Espinosa, BSc, PhD², Erika Wissinger, PhD³, Maria Koufopoulou, MSc².
¹Cencora, Hannover, Germany, ²Cencora, London, United Kingdom, ³Cencora, Conshohocken, PA, USA.

Presentation Documents

ISPOR26_Arregui_SA60_POSTER.pdf

OBJECTIVES: Systematic literature reviews (SLRs) are essential for evidence-based decision-making in health economics and outcomes research. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 checklist provides a standard for transparent SLR reporting, but manual compliance assessment is time-consuming. This study evaluates how a generative artificial intelligence (genAI)-chatbot performs in automating these compliance checks.
METHODS: Six published SLRs were evaluated using a genAI-chatbot via customized prompting based on the 27-item PRISMA 2020 checklist (42 questions in total). All data entered in the chatbot remains within Cencora, without being shared externally or used to train non-Cencora AI. Human reviewers independently assessed the same SLRs, with a subset of their evaluations undergoing validation for accuracy. Both the genAI-chatbot and human reviewers determined whether each checklist item was fulfilled and identified the supporting text. Concordance was categorized as full, partial, or none, depending on agreement regarding both the fulfillment status and supporting text.
RESULTS: The genAI-chatbot achieved 93% full agreement with human reviewers across PRISMA checklist items for the 6 SLRs. For the Title, Abstract, Introduction and Other information checklist sections, the genAI-chatbot and human reviewers provided consistent responses to every question. For the Methods section (17 questions), there was full agreement on 12, with disagreements primarily related to synthesis methods (specifically data preparation and tabulation) in at least two SLRs. In the Results section (11 questions), there was complete agreement on 6 items, and the genAI-chatbot aligned with human reviewers for 5 of the 6 SLRs on the remaining items. The most frequent disagreements in the Discussion section involved review process limitations.
CONCLUSIONS: Overall, the genAI-chatbot closely matched human assessments of PRISMA 2020 checklist adherence, indicating strong potential for automating SLR evaluations. However, some challenges persist, underscoring the continued need for human oversight to ensure reliable reporting.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

SA60

Topic

Study Approaches

Topic Subcategory

Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)