Stability of a Large Language Model for Data Extraction in Systematic Literature Reviews
Author(s)
Aiswarya Shree, MSc1, Mariana Farraia, MSc2, Carolina Casañas i Comabella, PhD3, Allie Cichewicz, MSc4.
1Thermo Fisher Scientific, Bengaluru, India, 2Thermo Fisher Scientific, Ede, Netherlands, 3Thermo Fisher Scientific, London, United Kingdom, 4Thermo Fisher Scientific, Boston, MA, USA.
1Thermo Fisher Scientific, Bengaluru, India, 2Thermo Fisher Scientific, Ede, Netherlands, 3Thermo Fisher Scientific, London, United Kingdom, 4Thermo Fisher Scientific, Boston, MA, USA.
Presentation Documents
OBJECTIVES: We previously established the accuracy of a large language model (LLM) to extract data for a systematic literature review (SLR), but observed response variations when prompted on different days. This study aimed to evaluate the reproducibility and reliability of LLM-extracted data when varying time of day or geographic location.
METHODS: Single-shot prompts were developed to extract 29 variables on study design, patient characteristics, and outcomes. Content extracted by the LLM was deployed in two ways: (1) by the same user twice within the same day (reproducibility); (2) by two different users located in different geographic locations (India and The Netherlands) on the same day (reliability). Both tests used the same prompts and publications, with LLM creativity set to 0. The first response in each set of LLM extractions served as the reference and was compared with the second. Reproducibility and reliability were calculated as the proportion of variables where LLM-extracted content was the same between responses. Accuracy was calculated as the proportion of correctly extracted variables.
RESULTS: Accuracy was slightly higher for the same user (79.1%-98.5%) compared to different users (74.6%-97.7%). Overall, reproducibility ranged from 80%-100%, and reliability from 65.8%-95.5%. Reproducibility for patient and outcomes variables (including text and numeric fields) ranged from 88.5%-100% and 72.7%-100%, respectively, and reliability from 62.5%-100% and 63.7%-100%, respectively. Study design variables (text fields) presented lower reliability (50%-90%), but accuracy remained consistent due to the LLM capturing the same underlying information with slightly different syntax. In contrast, accuracy was lower with numeric fields where differences between extractions were erroneous.
CONCLUSIONS: When using LLMs for data extraction, reproducibility is high, but reliability can be affected by user interaction, particularly for text fields. AI-driven extractions still require human validation and transparent reporting of AI-assisted methods to ensure that results are contextualized and scientific rigor is maintained.
METHODS: Single-shot prompts were developed to extract 29 variables on study design, patient characteristics, and outcomes. Content extracted by the LLM was deployed in two ways: (1) by the same user twice within the same day (reproducibility); (2) by two different users located in different geographic locations (India and The Netherlands) on the same day (reliability). Both tests used the same prompts and publications, with LLM creativity set to 0. The first response in each set of LLM extractions served as the reference and was compared with the second. Reproducibility and reliability were calculated as the proportion of variables where LLM-extracted content was the same between responses. Accuracy was calculated as the proportion of correctly extracted variables.
RESULTS: Accuracy was slightly higher for the same user (79.1%-98.5%) compared to different users (74.6%-97.7%). Overall, reproducibility ranged from 80%-100%, and reliability from 65.8%-95.5%. Reproducibility for patient and outcomes variables (including text and numeric fields) ranged from 88.5%-100% and 72.7%-100%, respectively, and reliability from 62.5%-100% and 63.7%-100%, respectively. Study design variables (text fields) presented lower reliability (50%-90%), but accuracy remained consistent due to the LLM capturing the same underlying information with slightly different syntax. In contrast, accuracy was lower with numeric fields where differences between extractions were erroneous.
CONCLUSIONS: When using LLMs for data extraction, reproducibility is high, but reliability can be affected by user interaction, particularly for text fields. AI-driven extractions still require human validation and transparent reporting of AI-assisted methods to ensure that results are contextualized and scientific rigor is maintained.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR105
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas