Exploring the Ability of Generative AI to Interpret and Report Comprehensive NMA Results: A Step Towards the Automation of NMA Reports

Author(s)

Emma Benbow, MA, PhD¹, Tim Reason, BSc, MSc¹, Cheryl Jayne Jones, BSc, MSc, PhD¹, Sven L. Klijn, MSc², Bill Malcolm, MSc³.
¹Estima Scientific, London, United Kingdom, ²Bristol Myers Squibb, Princeton, NJ, USA, ³Bristol Myers Squibb, Middlesex, United Kingdom.

OBJECTIVES: A previous study developed a system to automate network meta-analyses (NMAs) using large language models (LLMs). To maximize the benefits of automation, it is essential to automate the production of NMA reports, which include assessment and interpretation of results. This study aims to determine whether LLMs can assess and interpret comprehensive NMA outputs and accurately report these.
METHODS: This study used Claude 3.5 Sonnet (v2) to generate two NMA reports, including introduction, methods, results, and discussion sections. The reports were based on replicated results for overall survival (OS) and progression-free survival (PFS) from a previously published NMA in non-small cell lung cancer (Aggarwal, 2023) and change from baseline in HbA1c, and proportion of patients achieving HbA1c < 7% in diabetes (Witkowski, 2018). The results were provided as csv files and image files containing e.g., forest plots, rank-probability plots. The LLM was prompted to generate text interpreting comparative effectiveness results and associated statistical diagnostics. Prompting strategies included few-shot, pdf-interaction, and leveraging a “hybrid” approach that sent both figures and data to the LLM to maximise understanding of the evidence. The reports were qualitatively assessed by three expert NMA statisticians.
RESULTS: The three experts concurred that the LLM accurately interpreted the results, assessed rank-probability plots and provided an appropriate ordering of efficacy. The evaluation of heterogeneity, inconsistency, and convergence by the LLM was precise, providing a thorough analysis of the statistical evidence across all segments of the network.
CONCLUSIONS: The results show LLMs can correctly assess and interpret NMA results and the standardised associated statistical diagnostics and also automatically generate reports evaluating multiple outcomes. Scaling the approach to report results from alternative PICO selections, as required for health technology assessment (HTA) ready NMA reports and the Joint Clinical Assessment (JCA), could lead to substantial efficiency gains.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

P53

Topic

Health Technology Assessment

Topic Subcategory

Systems & Structure

Disease

Diabetes/Endocrine/Metabolic Disorders (including obesity), Oncology

Presentation (CTI)