Evaluation of the Use of ChatGPT in Data Extraction for Systematic Literature Reviews: Friend, Foe, or Fiction?

Author(s)

Alison Martin, MSc, MD, Kalliopi Roussi, PhD, Hannah Rice, BSc, Andrea Bertuzzi, PhD.
Crystallise Ltd, Colchester, United Kingdom.
OBJECTIVES: Previous research has found inconsistent benefits from large language models (LLMs) for data extraction within systematic literature reviews. We aimed to evaluate the accuracy and reproducibility of the latest ChatGPT for data extraction.
METHODS: We developed a prompt for ChatGPT 4.0 in December 2024 to January 2025 to extract the same specific data items on study methodology and baseline characteristics (BC) from a conference abstract, a clinical trial, a case report, and a retrospective observational study. We evaluated the LLM’s accuracy compared to a human researcher-derived gold standard and its consistency of output for each data item when run on the same papers three times a day for one week, then once a day for four weeks.
RESULTS: Overall accuracy varied substantially with no clear trend over time. ChatGPT 4.0 inconsistently and unpredictably extracted data: incorrect data in 17.5% of instances, data missing in 9.7%, fabricated data in 1.2% and unclear data in 0.9%. Errors were significantly higher for full text clinical trial reports (32.4% of instances) than for retrospective observational studies (22.9%) (p <0.001). Simple data types such as number of participants and location were correct in 61.8% to 100% and 93.5% to 100% of runs, respectively, but more complex outcomes such as extraction of the full citation were only correct in 0 to 33.3% of runs. No data item could be relied on to always be extracted correctly across all publication types.
CONCLUSIONS: ChatGPT 4.0 could not be relied on for accurate data extraction but might save time by providing an initial output for human researchers to validate. Accuracy of outputs will depend on the skill of the prompt generator, but varies considerably over time despite using the same prompt. Studies assessing the accuracy of AI for data extraction should take this temporal variation into account.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

MSR100

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×