Large-Language Models to Complement and Augment Literature Review: Hi! How Can I Help You?

Author(s)

Sarah Goring, MSc1, Simon J. Goring, PhD2;
1SMG Outcomes Research, Vancouver, BC, Canada, 2University of Wisconsin, American Family Data Science Institute, Madison, WI, USA
OBJECTIVES: Our objective was to evaluate whether large language models (LLMs) can be used to augment and complement literature reviews.
METHODS: We conducted a case study in non-small cell lung cancer (NSCLC), using a PubMed search run on December 11 2024. We wrote a Python script (LR_helper) to communicate with Generative Pre-trained Transformer 4 omni (GPT-4o) via application programming interface (API) calls. Zero-shot persona pattern prompting was used to generate a structured data set from title and abstracts. Data elements included: study name; disease stage; mutation or biomarker-defined sub-population; treatment setting; intervention; comparator; study design; and study phase. Prompt engineering was conducted prior to final implementation. Human-based extractions were used to evaluate performance. The script was repeated 5 times to evaluate GPT-4o response consistency across model runs. Clinicaltrials.gov identifiers were extracted and open-access articles were retrieved programmatically.
RESULTS: One hundred records were analyzed for the case study. Recall (sensitivity) was 100% for key disease categories (resectable, unresectable, advanced/metastatic NSCLC); mutation or biomarker-defined populations (EGRK+, ALK+, ROS1+, KRAS+, PD-L1-positive); treatment setting (neoadjuvant, perioperative, adjuvant, first-line, second-or-later line); study design (randomized controlled trial, non-randomized trial, observational); and phase (1, 1/2, 2, 3). Recall was lower across “other” and “I don’t know” categories (12% to 100%). Study names were 98% accurate, with 2 missed (no fabrications). Across the 5 model runs, consistency ranged from 96% to 100%; differences were limited to “other” and “I don’t know” fields. Automated retrieval of Clinicaltrials.gov identifiers and open-access articles facilitated cross-referencing.
CONCLUSIONS: Concerns about transparency and trustworthiness of LLMs currently limit their wider adoption for supporting decision-making around medical interventions. The current research demonstrates an application of LLMs which can be used (with human oversight) to augment/complement targeted reviews; inform study scoping; and enable expedited access to an up-to-date preliminary evidence set for systematic reviews while awaiting gold-standard human-based review.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

PT12

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, SDC: Oncology

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×