Evaluating Systematic Literature Review Screening Performance of Seven Large Language Models in Response to Different Prompting Strategies

Author(s)

Janoudi G¹, Rada (Uzun) M¹, Richter T², Walker M³
¹Loon, Ottawa, ON, Canada, ²BioCryst, Durham, NC, USA, ³University of Ottawa, Ottawa, ON, Canada

OBJECTIVES: To compare how different prompting techniques affect the ability of several Large Language Models (LLMs) to screen publications for eligibility in systematic literature reviews (SLRs).

METHODS: We used data from three systematic reviews listed in the publicly available Synergy dataset (N = 3,957 records). We measured the recall (sensitivity) and precision of seven leading LLMs of various sizes using three prompting techniques: zero-shot, multi-shot, and chain of thought (CoT). Identical inclusion and exclusion criteria were provided to each model. Performance metrics and 95% confidence intervals (CI) were calculated through bootstrapping.

RESULTS: CommandR+ recall and precision were 0.84 (95%CI: 0.76 to 0.91) and 0.11 (95%CI: 0.09 to 0.14) respectively, with CoT. Llama-3-70B recall and precision were 0.84 (95%CI: 0.76 to 0.91) and 0.19 (95%CI: 0.15 to 0.23) with CoT. Llama-3-8B recall and precision were 0.81 (95%CI: 0.73 to 0.90) and 0.10 (95%CI: 0.08 to 0.13) with CoT. GPT-4o recall and precision were 0.64 (95%CI: 0.54 to 0.75) and 0.22 (95%CI: 0.17 to 0.27) with CoT. Mistral-Large recall and precision were 0.64 (95%CI: 0.54 to 0.75) and 0.18 (95%CI: 0.14 to 0.23) with CoT. GPT-3.5 recall and precision were 0.64 (95%CI: 0.53 to 0.73) and 0.03 (95%CI: 0.03 to 0.04) with multi shot. CommandR recall and precision were 0.62 (95%CI: 0.51 to 0.72) and 0.11 (95%CI: 0.08 to 0.14) with CoT.

CONCLUSIONS: CoT produced better recall than zero- or multi-shot prompts in all models but GPT-3.5. While Llama-3-8B a relatively smaller model, produced better recall than two larger models, GPT-4o and Mistral Large, this was at the cost of lower precision. These results suggest that CoT prompting technique result in better recall than other techniques. Further research is needed to refine these models and prompting approaches to enhance their recall, precision, and utility in SLR screening.

Conference/Value in Health Info

2024-11, ISPOR Europe 2024, Barcelona, Spain

Value in Health, Volume 27, Issue 12, S2 (December 2024)

Code

MSR59

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic

Methodology

Presentation