Exploring the Ability of Generative AI to Interpret and Report Comprehensive NMA Results: A Step Towards the Automation of NMA Reports
Author(s)
Emma Benbow, MA, PhD1, Tim Reason, BSc, MSc1, Cheryl Jayne Jones, BSc, MSc, PhD1, Sven L. Klijn, MSc2, Bill Malcolm, MSc3.
1Estima Scientific, London, United Kingdom, 2Bristol Myers Squibb, Princeton, NJ, USA, 3Bristol Myers Squibb, Middlesex, United Kingdom.
1Estima Scientific, London, United Kingdom, 2Bristol Myers Squibb, Princeton, NJ, USA, 3Bristol Myers Squibb, Middlesex, United Kingdom.
OBJECTIVES: A previous study developed a system to automate network meta-analyses (NMAs) using large language models (LLMs). To maximize the benefits of automation, it is essential to automate the production of NMA reports, which include assessment and interpretation of results. This study aims to determine whether LLMs can assess and interpret comprehensive NMA outputs and accurately report these.
METHODS: This study used Claude 3.5 Sonnet (v2) to generate two NMA reports, including introduction, methods, results, and discussion sections. The reports were based on replicated results for overall survival (OS) and progression-free survival (PFS) from a previously published NMA in non-small cell lung cancer (Aggarwal, 2023) and change from baseline in HbA1c, and proportion of patients achieving HbA1c < 7% in diabetes (Witkowski, 2018). The results were provided as csv files and image files containing e.g., forest plots, rank-probability plots. The LLM was prompted to generate text interpreting comparative effectiveness results and associated statistical diagnostics. Prompting strategies included few-shot, pdf-interaction, and leveraging a “hybrid” approach that sent both figures and data to the LLM to maximise understanding of the evidence. The reports were qualitatively assessed by three expert NMA statisticians.
RESULTS: The three experts concurred that the LLM accurately interpreted the results, assessed rank-probability plots and provided an appropriate ordering of efficacy. The evaluation of heterogeneity, inconsistency, and convergence by the LLM was precise, providing a thorough analysis of the statistical evidence across all segments of the network.
CONCLUSIONS: The results show LLMs can correctly assess and interpret NMA results and the standardised associated statistical diagnostics and also automatically generate reports evaluating multiple outcomes. Scaling the approach to report results from alternative PICO selections, as required for health technology assessment (HTA) ready NMA reports and the Joint Clinical Assessment (JCA), could lead to substantial efficiency gains.
METHODS: This study used Claude 3.5 Sonnet (v2) to generate two NMA reports, including introduction, methods, results, and discussion sections. The reports were based on replicated results for overall survival (OS) and progression-free survival (PFS) from a previously published NMA in non-small cell lung cancer (Aggarwal, 2023) and change from baseline in HbA1c, and proportion of patients achieving HbA1c < 7% in diabetes (Witkowski, 2018). The results were provided as csv files and image files containing e.g., forest plots, rank-probability plots. The LLM was prompted to generate text interpreting comparative effectiveness results and associated statistical diagnostics. Prompting strategies included few-shot, pdf-interaction, and leveraging a “hybrid” approach that sent both figures and data to the LLM to maximise understanding of the evidence. The reports were qualitatively assessed by three expert NMA statisticians.
RESULTS: The three experts concurred that the LLM accurately interpreted the results, assessed rank-probability plots and provided an appropriate ordering of efficacy. The evaluation of heterogeneity, inconsistency, and convergence by the LLM was precise, providing a thorough analysis of the statistical evidence across all segments of the network.
CONCLUSIONS: The results show LLMs can correctly assess and interpret NMA results and the standardised associated statistical diagnostics and also automatically generate reports evaluating multiple outcomes. Scaling the approach to report results from alternative PICO selections, as required for health technology assessment (HTA) ready NMA reports and the Joint Clinical Assessment (JCA), could lead to substantial efficiency gains.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
P53
Topic
Health Technology Assessment
Topic Subcategory
Systems & Structure
Disease
Diabetes/Endocrine/Metabolic Disorders (including obesity), Oncology