Testing Automated-Prompt Engineering Strategies for Systematic Literature Review Screening

Author(s)

Kim Wager, DPhil¹, Gemma Carter, BSc, PhD¹, Christian Eichinger, PhD², Obaro Evuarherhe, PhD¹, Polly Field, DPhil¹, Tomas Rees, PhD¹.
¹Oxford PharmaGenesis, Oxford, United Kingdom, ²Oxford PharmaGenesis, Krakow, Poland.

OBJECTIVES: Artificial intelligence (AI) has emerged as a promising approach to alleviate the burden of citation screening in systematic reviews. Despite their potential, large language models (LLMs) exhibit notable variability in performance depending on subtle nuances in prompt design. We sought to test strategic prompt engineering techniques such as few-shot learning, chain-of-thought (CoT) reasoning and contextual example provision, which can significantly enhance LLM output.
METHODS: In this study, using data from a previously published systematic review about high-resolution peripheral quantitative computed tomography (HR-pQCT, n = 534 abstracts) as ground truth, we tested the impact of five automated prompt engineering strategies, including CoT, prompt optimizers, semantic few-shot prompting (k-nearest neighbour) and the Medprompt framework on the performance of LLMs in a citation screening task.
RESULTS: Performance was evaluated against human-screened ground truth data (n = 534). Using GPT-4o with zero-shot prompting as baseline (inclusion recall: 71%, exclusion recall: 77%), we found: automated CoT reasoning improved inclusion recall to 76% but reduced exclusion recall to 64%; Claude 3.5 Sonnet with CoT outperformed GPT-4o, achieving 79% inclusion recall and 91% exclusion recall; MIPROv2 prompt optimization improved inclusion recall to 85% but reduced exclusion recall to 52%; anthropic prompt improver yielded gains in recall (inclusion: 80%, exclusion: 97%); semantic few-shot prompting (4-shot) slightly reduced inclusion recall to 70% but improved exclusion recall to 82%; Medprompt framework components (few-shot examples, CoT, temperature = 1) showed mixed results with inclusion recall of 77% and exclusion recall of 53%.
CONCLUSIONS: Overall, this study confirmed model sensitivity to prompt instructions and the potential of AI in reducing screening burden. However, it remains vital that a fully human-in-the-loop approach is taken and that reviewers using AI have expertise in both systematic reviews and prompt engineering.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

SA92

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Presentation (CTI)