Stability of a Large Language Model for Data Extraction in Systematic Literature Reviews

Author(s)

Aiswarya Shree, MSc¹, Mariana Farraia, MSc², Carolina Casañas i Comabella, PhD³, Allie Cichewicz, MSc⁴.
¹Thermo Fisher Scientific, Bengaluru, India, ²Thermo Fisher Scientific, Ede, Netherlands, ³Thermo Fisher Scientific, London, United Kingdom, ⁴Thermo Fisher Scientific, Boston, MA, USA.

Presentation Documents

ISPOR25_Shree_MSR105_POSTER.pdf

OBJECTIVES: We previously established the accuracy of a large language model (LLM) to extract data for a systematic literature review (SLR), but observed response variations when prompted on different days. This study aimed to evaluate the reproducibility and reliability of LLM-extracted data when varying time of day or geographic location.
METHODS: Single-shot prompts were developed to extract 29 variables on study design, patient characteristics, and outcomes. Content extracted by the LLM was deployed in two ways: (1) by the same user twice within the same day (reproducibility); (2) by two different users located in different geographic locations (India and The Netherlands) on the same day (reliability). Both tests used the same prompts and publications, with LLM creativity set to 0. The first response in each set of LLM extractions served as the reference and was compared with the second. Reproducibility and reliability were calculated as the proportion of variables where LLM-extracted content was the same between responses. Accuracy was calculated as the proportion of correctly extracted variables.
RESULTS: Accuracy was slightly higher for the same user (79.1%-98.5%) compared to different users (74.6%-97.7%). Overall, reproducibility ranged from 80%-100%, and reliability from 65.8%-95.5%. Reproducibility for patient and outcomes variables (including text and numeric fields) ranged from 88.5%-100% and 72.7%-100%, respectively, and reliability from 62.5%-100% and 63.7%-100%, respectively. Study design variables (text fields) presented lower reliability (50%-90%), but accuracy remained consistent due to the LLM capturing the same underlying information with slightly different syntax. In contrast, accuracy was lower with numeric fields where differences between extractions were erroneous.
CONCLUSIONS: When using LLMs for data extraction, reproducibility is high, but reliability can be affected by user interaction, particularly for text fields. AI-driven extractions still require human validation and transparent reporting of AI-assisted methods to ensure that results are contextualized and scientific rigor is maintained.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR105

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)