Large-Language Models to Complement and Augment Literature Review: Hi! How Can I Help You?
Author(s)
Sarah Goring, MSc1, Simon J. Goring, PhD2;
1SMG Outcomes Research, Vancouver, BC, Canada, 2University of Wisconsin, American Family Data Science Institute, Madison, WI, USA
1SMG Outcomes Research, Vancouver, BC, Canada, 2University of Wisconsin, American Family Data Science Institute, Madison, WI, USA
OBJECTIVES: Our objective was to evaluate whether large language models (LLMs) can be used to augment and complement literature reviews.
METHODS: We conducted a case study in non-small cell lung cancer (NSCLC), using a PubMed search run on December 11 2024. We wrote a Python script (LR_helper) to communicate with Generative Pre-trained Transformer 4 omni (GPT-4o) via application programming interface (API) calls. Zero-shot persona pattern prompting was used to generate a structured data set from title and abstracts. Data elements included: study name; disease stage; mutation or biomarker-defined sub-population; treatment setting; intervention; comparator; study design; and study phase. Prompt engineering was conducted prior to final implementation. Human-based extractions were used to evaluate performance. The script was repeated 5 times to evaluate GPT-4o response consistency across model runs. Clinicaltrials.gov identifiers were extracted and open-access articles were retrieved programmatically.
RESULTS: One hundred records were analyzed for the case study. Recall (sensitivity) was 100% for key disease categories (resectable, unresectable, advanced/metastatic NSCLC); mutation or biomarker-defined populations (EGRK+, ALK+, ROS1+, KRAS+, PD-L1-positive); treatment setting (neoadjuvant, perioperative, adjuvant, first-line, second-or-later line); study design (randomized controlled trial, non-randomized trial, observational); and phase (1, 1/2, 2, 3). Recall was lower across “other” and “I don’t know” categories (12% to 100%). Study names were 98% accurate, with 2 missed (no fabrications). Across the 5 model runs, consistency ranged from 96% to 100%; differences were limited to “other” and “I don’t know” fields. Automated retrieval of Clinicaltrials.gov identifiers and open-access articles facilitated cross-referencing.
CONCLUSIONS: Concerns about transparency and trustworthiness of LLMs currently limit their wider adoption for supporting decision-making around medical interventions. The current research demonstrates an application of LLMs which can be used (with human oversight) to augment/complement targeted reviews; inform study scoping; and enable expedited access to an up-to-date preliminary evidence set for systematic reviews while awaiting gold-standard human-based review.
METHODS: We conducted a case study in non-small cell lung cancer (NSCLC), using a PubMed search run on December 11 2024. We wrote a Python script (LR_helper) to communicate with Generative Pre-trained Transformer 4 omni (GPT-4o) via application programming interface (API) calls. Zero-shot persona pattern prompting was used to generate a structured data set from title and abstracts. Data elements included: study name; disease stage; mutation or biomarker-defined sub-population; treatment setting; intervention; comparator; study design; and study phase. Prompt engineering was conducted prior to final implementation. Human-based extractions were used to evaluate performance. The script was repeated 5 times to evaluate GPT-4o response consistency across model runs. Clinicaltrials.gov identifiers were extracted and open-access articles were retrieved programmatically.
RESULTS: One hundred records were analyzed for the case study. Recall (sensitivity) was 100% for key disease categories (resectable, unresectable, advanced/metastatic NSCLC); mutation or biomarker-defined populations (EGRK+, ALK+, ROS1+, KRAS+, PD-L1-positive); treatment setting (neoadjuvant, perioperative, adjuvant, first-line, second-or-later line); study design (randomized controlled trial, non-randomized trial, observational); and phase (1, 1/2, 2, 3). Recall was lower across “other” and “I don’t know” categories (12% to 100%). Study names were 98% accurate, with 2 missed (no fabrications). Across the 5 model runs, consistency ranged from 96% to 100%; differences were limited to “other” and “I don’t know” fields. Automated retrieval of Clinicaltrials.gov identifiers and open-access articles facilitated cross-referencing.
CONCLUSIONS: Concerns about transparency and trustworthiness of LLMs currently limit their wider adoption for supporting decision-making around medical interventions. The current research demonstrates an application of LLMs which can be used (with human oversight) to augment/complement targeted reviews; inform study scoping; and enable expedited access to an up-to-date preliminary evidence set for systematic reviews while awaiting gold-standard human-based review.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
PT12
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas, SDC: Oncology