Performance of Large Language Model Clinical Data Extraction by Data Domain: A Rapid Systematic Review
Author(s)
Judith Peatman, MSc, Shona Lang, PhD, Emily Hardy, MBiol.
Petauri Evidence, Bicester, United Kingdom.
Petauri Evidence, Bicester, United Kingdom.
OBJECTIVES: To compare the reported accuracy of large language model (LLM) data extraction for clinical publications according to data domain.
METHODS: A rapid systematic review was conducted to identify records reporting data extraction of clinical publications using an LLM. The LLM used to perform data extraction was mapped for all included records, after which only records reporting quantitative performance metrics by data domain were considered for further analysis. The data domains of interest were based on those commonly extracted from clinical publications: study characteristics/design, patient characteristics, intervention characteristics, and study outcomes.
RESULTS: A total of 31 records were included and mapped to obtain the LLM used for data extraction. GPT-4 was the most commonly reported LLM tool (n=25 records). Of the 31 included records, 15 reported on the performance of LLM data extraction by data domain. There was a large degree of heterogeneity between records in relation to reporting metrics for LLM performance, reference standard, and definition of extraction failure. Accuracy (or decision match percentage) was the most common LLM performance reporting metric, alongside precision, recall, F1 score, and others. Reported accuracy varied greatly within each data domain, ranging from 17.0-100.0% for study characteristics/design; 9.1-100.0% for patient characteristics; 36.0-100.0% for intervention characteristics; and 30.0-100.0% for study outcomes. Factors impacting reported LLM extraction accuracy included data source format, extent of LLM prompt engineering, publication language, and level of human involvement.
CONCLUSIONS: Utilisation of LLMs for data extraction is a rapidly developing area in health economics and outcomes research. The findings from this rapid review indicate that LLM extraction can be suitable across different data domains; however, careful and tailored application is required to ensure sufficient accuracy.
METHODS: A rapid systematic review was conducted to identify records reporting data extraction of clinical publications using an LLM. The LLM used to perform data extraction was mapped for all included records, after which only records reporting quantitative performance metrics by data domain were considered for further analysis. The data domains of interest were based on those commonly extracted from clinical publications: study characteristics/design, patient characteristics, intervention characteristics, and study outcomes.
RESULTS: A total of 31 records were included and mapped to obtain the LLM used for data extraction. GPT-4 was the most commonly reported LLM tool (n=25 records). Of the 31 included records, 15 reported on the performance of LLM data extraction by data domain. There was a large degree of heterogeneity between records in relation to reporting metrics for LLM performance, reference standard, and definition of extraction failure. Accuracy (or decision match percentage) was the most common LLM performance reporting metric, alongside precision, recall, F1 score, and others. Reported accuracy varied greatly within each data domain, ranging from 17.0-100.0% for study characteristics/design; 9.1-100.0% for patient characteristics; 36.0-100.0% for intervention characteristics; and 30.0-100.0% for study outcomes. Factors impacting reported LLM extraction accuracy included data source format, extent of LLM prompt engineering, publication language, and level of human involvement.
CONCLUSIONS: Utilisation of LLMs for data extraction is a rapidly developing area in health economics and outcomes research. The findings from this rapid review indicate that LLM extraction can be suitable across different data domains; however, careful and tailored application is required to ensure sufficient accuracy.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
SA75
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas