Innovations in Automated Survival Curve Selection and Reporting of Survival Analyses Through Generative AI

Author(s)

Wu Y1, Klijn S2, Teitsson S3, Malcolm B4, Jones C5, Rawlinson W6
1Estima Scientific Ltd, South Ruislip, LON, UK, 2Bristol Myers Squibb, Utrecht, ZH, Netherlands, 3Bristol Myers Squibb, Uxbridge, LON, UK, 4Bristol Myers Squibb, Middlesex, LON, UK, 5Estima Scientific Ltd, London, LON, UK, 6Estima Scientific Ltd, London, UK

OBJECTIVES: Survival analyses are a core part of many HTA submissions where extrapolation of time-to-event clinical endpoints is required. The purpose of this research was to explore automation of survival analysis reporting using Generative Artificial Intelligence (GenAI). Following published best practices for curve selection, GenAI was leveraged to recommend an appropriate extrapolation curve and provide justifications.

METHODS: Data were taken from a previously accepted HTA survival analysis report (NICE TA817) for patients treated for resectable urothelial cancer (PD-L1 ≥1%), with a minimum follow-up of 11-months. GPT-4o was provided with survival analysis outputs, including statistical tests, survival probability estimates, and figures, to assess proportional hazards (PH) and goodness-of-fit. Prompted with relevant content, GPT-4o was asked to; 1) assess PH, 2) select suitable extrapolation models (dependent vs. independent), 3) consider external data, then 4) select an appropriate curve. To validate accuracy, GPT-4o’s results were compared with results in the original report, the report published by NICE, and assessed against the opinion of three expert health economists.

RESULTS: GPT-4o’s interpretation of log-cumulative hazard plots, Schoenfeld residual plots, and Grambsch-Therneau test results aligned with interpretations made by the three health economic experts, the human produced report, and the NICE committee. GPT-4o concluded that the PH assumption might be violated, therefore suggesting consideration of both dependent and independent parametric models. Based on a comprehensive analysis of goodness-of-fit, visual fit, and long-term external survival data, GPT-4o recommended the same survival curves as those selected in the original report and by the NICE Committee. Notably, 13/13 statements or decisions made by GPT-4o were consistent with the original report or expert opinion.

CONCLUSIONS: The results suggest automation of curve selection and reporting of survival analyses is possible. However, more research is required to determine generalizability with differing levels of data maturity and to test the performance of alternative GenAI models.