FROM GPT-4 TO GPT-5.2: A COMPARATIVE EVALUATION OF LARGE LANGUAGE MODELS FOR EXTRACTING CLINICAL REAL-WORLD EVIDENCE DATA(RWE)

Author(s)

Mariana Farraia, PhD¹, Anuja Pandey, MD², Kassandra Schaible³, Caroline von Wilamowitz-Moellendorff, PhD⁴;
¹Thermo Fischer Scientific, Ede, Netherlands, ²Thermo Fischer Scientific, London, United Kingdom, ³Thermo Fisher Scientific, Pittsburgh, PA, USA, ⁴Thermo Fisher Scientific, London, United Kingdom

OBJECTIVES: Large language models (LLMs) are being increasingly explored as tools to support automated data extraction in evidence synthesis. However, as these artificial intelligence (AI) models become more advanced, it is critical to monitor their abilities to perform tasks, and to confirm they are improving. This study aimed to qualitatively compare the older GPT-4 system to the newer GPT-5.2 version for structured data extraction of RWE within a standardized extraction framework.
METHODS: A structured data extraction framework for RWE in non-small cell lung cancer which was previously presented, and used a proprietary model based on GPT-4, was replicated using GPT-5.2. Identical prompts, extraction templates, and source publications were applied across both models. Extracted data elements included study characteristics, population descriptors, interventions, comparators, outcomes, and subgroup information. Outputs were reviewed and compared between model versions, focusing on completeness and accuracy. Differences were categorized by data type and reporting complexity.
RESULTS: GPT-5.2 generally extracted study data accurately and performed well when compared with GPT-4. Core study characteristics and high-level outcomes were consistently identified across both models, with similar structuring and minimal human correction required. Divergences between models were observed primarily for complex or ambiguously reported data elements. Overall, this comparison suggested GPT-5.2 performed at least as well as GPT-4, with remaining differences highlighting areas where automated extraction remains methodologically challenging.
CONCLUSIONS: This research suggests that GPT-5.2 can produce data extractions broadly comparable to those generated by GPT-4 within a standardized workflow, and that gradual improvements in data extraction of RWE are possible as AI models become more advanced. Ongoing methodological evaluation remains essential to understand any improvements or limitations per new GPT version, ensure reproducibility, and define appropriate roles for human oversight as automated extraction tools continue to evolve.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR33

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)