Variability and Improvements of Answers Generated with Different Versions of Large Language Models


Benbow E1, Reason T2, Malcolm B3, Klijn S4, Hill N5, Teitsson S6
1Estima Scientific Ltd, Ruislip, UK, 2Estima Scientific Ltd, South Ruislip, LON, UK, 3Bristol Myers Squibb, Middlesex, LON, UK, 4Bristol-Myers Squibb, Utrecht, ZH, Netherlands, 5Bristol Myers Squibb Company, Princeton, NJ, USA, 6Bristol Myers Squibb, Uxbridge, UK

OBJECTIVES: Since OpenAI’s release of the GPT-3.5 large language model (LLM) in March 2022, subsequent updates have introduced new and enhanced models. The impact of response variations among these models on the accuracy of automated network meta-analyses (NMAs) remains uncertain. The objective was to evaluate the variability and improvements in answers generated by different LLMs during data extraction for an NMA of overall survival in non-small cell lung cancer patients.

METHODS: Using a range of LLMs, via a Python API, we extracted survival data from publications of five studies. We have investigated the variability and accuracy of the data extraction achieved by repeatedly extracting the data from the study publications (20 iterations of the Python script per model) and comparing the results with the data extraction conducted (and checked) by systematic literature review and NMA experts.

RESULTS: Each iteration required extraction of 36 data items. For the worst performing model (GPT-3.5 turbo), correct extraction per iteration ranged from 0 to 36, with an overall mean of 57.4%. This significantly improved for GPT-4 Turbo Beta, which correctly extracted between 30 and 36 items per iteration, averaging 98.8%. The best performing model (GPT-4) correctly extracted between 34 and 36 items per iteration, with an overall mean of 99.4%.

CONCLUSIONS: GPT models have exhibited notable enhancements in accurately extracting required NMA data. Whilst GPT-4 demonstrated superior performance in this limited test, it was not significantly better than GPT-4 Turbo Beta. The potential release of the production version may further boost GPT-4 Turbo's performance, potentially surpassing that of GPT-4. GPT-4 Turbo also holds promise for more intricate data extraction tasks, given its significantly larger token limit.

Conference/Value in Health Info

2024-05, ISPOR 2024, Atlanta, GA, USA

Value in Health, Volume 27, Issue 6, S1 (June 2024)




Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Meta-Analysis & Indirect Comparisons


No Additional Disease & Conditions/Specialized Treatment Areas, Oncology

Explore Related HEOR by Topic

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on Update my browser now