Use of Large Language Model (LLM) for Full-Text Screening in Systematic Literature Reviews: A Comparative Analysis

Speaker(s)

Rathi H1, Malik A2, Behera DC2, Kamboj G3
1Skyward Analytics Pvt. Ltd. and EasySLR Pvt. Ltd., Gurugram, Haryana, India, 2EasySLR Pvt. Ltd., Gurugram, Haryana, India, 3Skyward Analytics Pvt. Ltd., Gurugram, Haryana, India

OBJECTIVES: The objective of this study was to compare the performance of three large language models (LLMs) — Anthropic Claude, OpenAI GPT, and our Proprietary Model in the full-text screening stage of a systematic literature review (SLR).

METHODS: We fed identical screening rules and search strategies to all LLMs for full-text screening of 100 studies. The decisions made by the human reviewer were taken as the reference response to assess the performance of the LLMs. The assessment criteria included decision match rate (identical inclusion and exclusion decisions between the human reviewer and LLM), sensitivity score (correct inclusions by LLM relative to the human reviewer), specificity score (correct exclusions by LLM relative to the human reviewer), and F1 score (predictive performance measure).

RESULTS: Anthropic Claude, OpenAI GPT, and our Proprietary Model scored a decision match rate of 77.0%, 73.6%, and 72.4%, respectively. The corresponding sensitivity scores were 0.76, 0.82, and 0.94, with specificity scores being 0.77, 0.71, and 0.67, respectively. OpenAI GPT, Anthropic Claude, and our Proprietary Model had F1 scores of 0.55, 0.53, and 0.57, respectively. In scenario analysis, we noted that the performance metrics for both LLMs varied substantially based on the changes in screening rules and the number of studies analyzed.

CONCLUSIONS: All three LLMs were comparable in the decision match rate and F1 score metric. While in this simulation, our Proprietary Model showed a better sensitivity score than Anthropic Claude and OpenAI GPT, these results should be interpreted cautiously, as they may vary with different research questions. The findings highlight LLMs' potential to assist with the SLR process. Future research should consider analyzing the performance of LLMs on larger datasets and calibrating the framing of screening rules for better understanding by LLMs.

Code

MSR28

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis

Disease

No Additional Disease & Conditions/Specialized Treatment Areas