Using a Large Language Model (LLM) for Data Extraction of Studies: Learnings From a Targeted Literature Review (TLR) in Non-Small Cell Lung Cancer (NSCLC)
Author(s)
Mariana Farraia, PhD1, Anuja Pandey, MD2, Eugenia Priedane, MA3, Allie Cichewicz, MSc4, Caroline von Wilamowitz-Moellendorff, PhD2.
1Thermo Fischer Scientific, Ede, Netherlands, 2Thermo Fisher Scientific, London, United Kingdom, 3HEOR EU and New Markets, BeOne Medicines (UK), Ltd., London, United Kingdom, 4Thermo Fisher Scientific, Waltham, MA, USA.
1Thermo Fischer Scientific, Ede, Netherlands, 2Thermo Fisher Scientific, London, United Kingdom, 3HEOR EU and New Markets, BeOne Medicines (UK), Ltd., London, United Kingdom, 4Thermo Fisher Scientific, Waltham, MA, USA.
OBJECTIVES: Data from published literature is accurately extracted by LLMs, reducing the human effort for literature reviews. However, underlying challenges faced by evidence synthesis experts in addressing complex research questions, such as those involving mixed populations and subgroups, are not fully understood. This study aimed to evaluate GPT-4-assisted extraction of clinical outcomes in a NSCLC subpopulation and highlight the learnings/challenges from its application.
METHODS: A TLR assessed the effectiveness/safety of treatments for NSCLC with programmed death-ligand-1 (PD-L1) expression ≥50%. Data from sixteen publications covering ten observational studies were extracted using a proprietary LLM. Zero-shot prompts were developed, tested, and optimised using one publication, then applied to all publications. The LLM outputs were copied into a pre-defined data extraction table to capture study/patient characteristics, and effectiveness/safety outcomes, including subgroup data (e.g., PD-L1, sex, age). Extractions were validated by an experienced investigator, and the main challenges were noted.
RESULTS: Two main challenges were identified; difficulties in isolating data for subpopulations (PD-L1 ≥50%) in mixed population studies, and incorrect or missing data extracted by LLM for subgroups. Detailed validation of results, additional extraction and re-validation of subgroup data, and correction of formatting issues resulted in time expenditure equal to or greater than validating manual extractions. The lack of standardisation in reporting observational studies also contributed to errors in LLM-assisted extraction. The LLM also did not recognise related publications reporting on the same studies.
CONCLUSIONS: Using LLMs for data extraction for nuanced populations may not yield significant time savings due to increased validation efforts. Experienced, human supervision and validation remain crucial for accuracy and completeness. Reviews must account for time spent on prompt optimisation to capture relevant subpopulations across publications. Different prompts for subpopulations and related publications are recommended, but prompt development time should be considered. Future work should explore LLM capabilities to better handle complex data.
METHODS: A TLR assessed the effectiveness/safety of treatments for NSCLC with programmed death-ligand-1 (PD-L1) expression ≥50%. Data from sixteen publications covering ten observational studies were extracted using a proprietary LLM. Zero-shot prompts were developed, tested, and optimised using one publication, then applied to all publications. The LLM outputs were copied into a pre-defined data extraction table to capture study/patient characteristics, and effectiveness/safety outcomes, including subgroup data (e.g., PD-L1, sex, age). Extractions were validated by an experienced investigator, and the main challenges were noted.
RESULTS: Two main challenges were identified; difficulties in isolating data for subpopulations (PD-L1 ≥50%) in mixed population studies, and incorrect or missing data extracted by LLM for subgroups. Detailed validation of results, additional extraction and re-validation of subgroup data, and correction of formatting issues resulted in time expenditure equal to or greater than validating manual extractions. The lack of standardisation in reporting observational studies also contributed to errors in LLM-assisted extraction. The LLM also did not recognise related publications reporting on the same studies.
CONCLUSIONS: Using LLMs for data extraction for nuanced populations may not yield significant time savings due to increased validation efforts. Experienced, human supervision and validation remain crucial for accuracy and completeness. Reviews must account for time spent on prompt optimisation to capture relevant subpopulations across publications. Different prompts for subpopulations and related publications are recommended, but prompt development time should be considered. Future work should explore LLM capabilities to better handle complex data.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
SA101
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas