Stability of a Large Language Model for Data Extraction in Systematic Literature Reviews

Author(s)

Aiswarya Shree, MSc1, Mariana Farraia, MSc2, Carolina Casañas i Comabella, PhD3, Allie Cichewicz, MSc4.
1Thermo Fisher Scientific, Bengaluru, India, 2Thermo Fisher Scientific, Ede, Netherlands, 3Thermo Fisher Scientific, London, United Kingdom, 4Thermo Fisher Scientific, Boston, MA, USA.

Presentation Documents

OBJECTIVES: We previously established the accuracy of a large language model (LLM) to extract data for a systematic literature review (SLR), but observed response variations when prompted on different days. This study aimed to evaluate the reproducibility and reliability of LLM-extracted data when varying time of day or geographic location.
METHODS: Single-shot prompts were developed to extract 29 variables on study design, patient characteristics, and outcomes. Content extracted by the LLM was deployed in two ways: (1) by the same user twice within the same day (reproducibility); (2) by two different users located in different geographic locations (India and The Netherlands) on the same day (reliability). Both tests used the same prompts and publications, with LLM creativity set to 0. The first response in each set of LLM extractions served as the reference and was compared with the second. Reproducibility and reliability were calculated as the proportion of variables where LLM-extracted content was the same between responses. Accuracy was calculated as the proportion of correctly extracted variables.
RESULTS: Accuracy was slightly higher for the same user (79.1%-98.5%) compared to different users (74.6%-97.7%). Overall, reproducibility ranged from 80%-100%, and reliability from 65.8%-95.5%. Reproducibility for patient and outcomes variables (including text and numeric fields) ranged from 88.5%-100% and 72.7%-100%, respectively, and reliability from 62.5%-100% and 63.7%-100%, respectively. Study design variables (text fields) presented lower reliability (50%-90%), but accuracy remained consistent due to the LLM capturing the same underlying information with slightly different syntax. In contrast, accuracy was lower with numeric fields where differences between extractions were erroneous.
CONCLUSIONS: When using LLMs for data extraction, reproducibility is high, but reliability can be affected by user interaction, particularly for text fields. AI-driven extractions still require human validation and transparent reporting of AI-assisted methods to ensure that results are contextualized and scientific rigor is maintained.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR105

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×