Variability and Improvements of Answers Generated with Different Versions of Large Language Models

Author(s)

Benbow E¹, Reason T², Malcolm B³, Klijn S⁴, Hill N⁵, Teitsson S⁶
¹Estima Scientific Ltd, Ruislip, UK, ²Estima Scientific Ltd, South Ruislip, LON, UK, ³Bristol Myers Squibb, Middlesex, LON, UK, ⁴Bristol-Myers Squibb, Utrecht, ZH, Netherlands, ⁵Bristol Myers Squibb Company, Princeton, NJ, USA, ⁶Bristol Myers Squibb, Uxbridge, UK

Presentation Documents

ISPOR2024_Benbow_MSR68_Poster_Variability138649.pdf

OBJECTIVES: Since OpenAI’s release of the GPT-3.5 large language model (LLM) in March 2022, subsequent updates have introduced new and enhanced models. The impact of response variations among these models on the accuracy of automated network meta-analyses (NMAs) remains uncertain. The objective was to evaluate the variability and improvements in answers generated by different LLMs during data extraction for an NMA of overall survival in non-small cell lung cancer patients.

METHODS: Using a range of LLMs, via a Python API, we extracted survival data from publications of five studies. We have investigated the variability and accuracy of the data extraction achieved by repeatedly extracting the data from the study publications (20 iterations of the Python script per model) and comparing the results with the data extraction conducted (and checked) by systematic literature review and NMA experts.

RESULTS: Each iteration required extraction of 36 data items. For the worst performing model (GPT-3.5 turbo), correct extraction per iteration ranged from 0 to 36, with an overall mean of 57.4%. This significantly improved for GPT-4 Turbo Beta, which correctly extracted between 30 and 36 items per iteration, averaging 98.8%. The best performing model (GPT-4) correctly extracted between 34 and 36 items per iteration, with an overall mean of 99.4%.

CONCLUSIONS: GPT models have exhibited notable enhancements in accurately extracting required NMA data. Whilst GPT-4 demonstrated superior performance in this limited test, it was not significantly better than GPT-4 Turbo Beta. The potential release of the production version may further boost GPT-4 Turbo's performance, potentially surpassing that of GPT-4. GPT-4 Turbo also holds promise for more intricate data extraction tasks, given its significantly larger token limit.

Conference/Value in Health Info

2024-05, ISPOR 2024, Atlanta, GA, USA

Value in Health, Volume 27, Issue 6, S1 (June 2024)

Code

MSR68

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Meta-Analysis & Indirect Comparisons

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, Oncology

Explore Related HEOR by Topic

Methodology

Presentation