How Accurate Are Large Language Models for Abstract Screening in Systematic Literature Reviews?

Speaker(s)

Sertel A¹, Samur S², Yildirim I¹, Ayer T³, Chhatwal J⁴
¹Value Analytics Labs, Boston, MA, USA, ²Value Analytics Labs, Chantilly, VA, USA, ³Georgia Institute of Technology, Atlanta, GA, USA, ⁴Harvard Medical School, Boston, MA, USA

Presentation Documents

How Accurate are Large Language Models for Abstract Screening in Systematic Literature Reviews_4.29.2024139693.pdf

OBJECTIVES: In systematic Literature Reviews (SLR), screening of abstracts is a time-consuming process. Large Language Models (LLMs) such as ChatGPT, show potential for automating SLR, especially in text-heavy tasks like screening. However, the accuracy of LLMs heavily depends on prompts created for this task. Our objective was to evaluate various prompting techniques in SLR abstract screening, offering insights for improving SLR workflows with LLMs.

METHODS: We implemented a Python pipeline with the OpenAI GPT-4 API to screen abstracts and categorize them as included or excluded. We tested five prompting techniques: Zero-shot, Few-shot, Chain-of-thought (COT), Zero-shot COT, and Dividing into subtasks. We employed the LangChain library to conduct parallel testing of these techniques. Our experiments included 2,950 papers from initial queries on PubMed and Embase databases. Results were compared based on accuracy, defined by percentage of abstracts matching human reviewers’ decisions, execution time, and cost.

RESULTS: The overall accuracy, encompassing the inclusion and exclusion of abstracts, was highest with Few-shot (82%) and lowest with Chain-of-thought (65%). Because cost is primarily linked to the number of tokens in both input and output prompts, the costliest technique was Few-shot ($280) with Zero-shot being the least costly one ($160). Although various factors such as the number of tokens, server load, network speed can influence the execution time, in our experiments, Zero-shot was the fastest method whereas the Zero-shot COT was the slowest one.

CONCLUSIONS: This study sheds light on the effectiveness of five well-known prompting techniques in a conventional abstract screening step of an SLR. The integration of LLMs holds promise in revolutionizing the landscape of SLR. However, further improvements are needed to increase the accuracy of LLMs before they can be fully implemented for automating SLRs.

Code

MSR44

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

ISPOR 2024

May 5-8, 2024