An AI-Enhanced Targeted Literature Review Workflow to Support Inclusive Research Practices and Population Science Research: A Human-in-the-Loop Approach

Author(s)

Gemma carter, PhD¹, Obaro Evuarherhe, PhD¹, Kim Wager, BSc (Hons) MSc DPhil¹, Ruma Bhagat, MD, MPH², Nicole Richie, PhD², Bruno Jolain, MD².
¹Oxford PharmaGenesis Ltd, Oxford, United Kingdom, ²F. Hoffman-La Roche Ltd, Basel, Switzerland.

OBJECTIVES: Inclusive research practices require an understanding of population differences for a given disease area. However, for disease areas with extensive literature, identifying relevant evidence can be time consuming and resource intensive. Large-language models, such as GPT-4, can streamline targeted literature reviews (TLRs) by accelerating the screening process while maintaining high accuracy. Here, we developed a GPT-4-based approach to automate citation screening for a TLR to understand health disparities in inflammatory bowel disease.
METHODS: Initial searches in Medline and Embase identified 4338 articles. A pilot set of 251 articles was manually reviewed to establish a ground truth dataset for model testing. We accessed GPT-4 via its application programming interface and implemented custom code to automate the screening process. We developed prompts for article inclusion based on predefined eligibility criteria, which were iteratively refined over five rounds of testing. For each iteration, performance metrics, including sensitivity, specificity and overall accuracy, were calculated by comparing GPT-4 outputs against the ground truth dataset. Prompt refinement prioritized sensitivity over specificity to reduce the risk of missing relevant studies. After refinement, the prompts were used to screen the remaining articles without further human involvement.
RESULTS: In the first iteration, GPT-4 prompts achieved 89.0% sensitivity, 53.1% specificity, and 66.1% overall accuracy, with eight articles incorrectly excluded. After five refinement phases, the final version demonstrated improved performance, achieving 97.2% sensitivity, 71.0% specificity, and 82.1% overall accuracy, with only three eligible articles incorrectly excluded.
CONCLUSIONS: Our AI-assisted workflow demonstrates that GPT-4, when combined with systematic prompt engineering and human oversight, delivers high sensitivity and accuracy in TLR screening. This scalable, robust approach, specifically facilitating reviews for disease areas with large screening burdens that may have been previously unfeasible, enables TLRs in inclusive research practices and population science research to inform clinical trial design and meet regulatory diversity requirements.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

SA11

Topic

Study Approaches

Topic Subcategory

Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)