Use of Large Language Model (LLM) for Full-Text Screening in Systematic Literature Reviews: A Comparative Analysis
Speaker(s)
Rathi H1, Malik A2, Behera DC2, Kamboj G3
1Skyward Analytics Pvt. Ltd. and EasySLR Pvt. Ltd., Gurugram, Haryana, India, 2EasySLR Pvt. Ltd., Gurugram, Haryana, India, 3Skyward Analytics Pvt. Ltd., Gurugram, Haryana, India
Presentation Documents
OBJECTIVES: The objective of this study was to compare the performance of three large language models (LLMs) — Anthropic Claude, OpenAI GPT, and our Proprietary Model in the full-text screening stage of a systematic literature review (SLR).
METHODS: We fed identical screening rules and search strategies to all LLMs for full-text screening of 100 studies. The decisions made by the human reviewer were taken as the reference response to assess the performance of the LLMs. The assessment criteria included decision match rate (identical inclusion and exclusion decisions between the human reviewer and LLM), sensitivity score (correct inclusions by LLM relative to the human reviewer), specificity score (correct exclusions by LLM relative to the human reviewer), and F1 score (predictive performance measure).
RESULTS: Anthropic Claude, OpenAI GPT, and our Proprietary Model scored a decision match rate of 77.0%, 73.6%, and 72.4%, respectively. The corresponding sensitivity scores were 0.76, 0.82, and 0.94, with specificity scores being 0.77, 0.71, and 0.67, respectively. OpenAI GPT, Anthropic Claude, and our Proprietary Model had F1 scores of 0.55, 0.53, and 0.57, respectively. In scenario analysis, we noted that the performance metrics for both LLMs varied substantially based on the changes in screening rules and the number of studies analyzed.
CONCLUSIONS: All three LLMs were comparable in the decision match rate and F1 score metric. While in this simulation, our Proprietary Model showed a better sensitivity score than Anthropic Claude and OpenAI GPT, these results should be interpreted cautiously, as they may vary with different research questions. The findings highlight LLMs' potential to assist with the SLR process. Future research should consider analyzing the performance of LLMs on larger datasets and calibrating the framing of screening rules for better understanding by LLMs.
Code
MSR28
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis
Disease
No Additional Disease & Conditions/Specialized Treatment Areas