A Comparative Analysis of Large Language Models (LLM) Utilised in Systematic Literature Review
Author(s)
Rathi H1, Malik A2, Behera DC2, Kamboj G3
1Skyward Analytics Pvt. Ltd., Gurugram, Haryana, India, 2EasySLR Pvt. Ltd., Gurugram, Haryana, India, 3Skyward Analytics Pvt. Ltd., New Delhi, DL, India
Presentation Documents
OBJECTIVES: The objective of this study was to conduct a comparison of three Large Language Models (LLMs) — AI21 Ultra, OpenAI GPT-4, and Google Vertex Artificial Intelligence (AI) Model Bison in their application to primary screening during a systematic literature review (SLR).
METHODS: We fed three to five sample responses and identical screening rules to all LLMs for primary screening (title and abstract screening) of 100 studies. We compared the decision made by the human reviewer, assumed as reference response, to gauge the performance of the LLMs. Models were assessed on decision match rate (defined as cases where inclusion and exclusion decisions were identical between the human reviewer and LLM) and sensitivity score (defined as the number of correct inclusions by LLM to overall inclusions by the human reviewer).
RESULTS: Model Bison, GPT-4, and AI21 Ultra scored a decision match rate of 67.01%, 65.56%, and 64.0%, respectively. Model Bison had the highest sensitivity score of 0.90, followed by 0.74 and 0.71 for GPT-4 and AI21 Ultra, respectively. In scenario analysis, we noted that the performance metrics of all LLMs varied substantially based on the amends in screening rules, the number of sample responses fed, and the number of studies analysed. While the decision match rate dropped to 51% in a few scenarios, the sensitivity score increased to 0.97 in others.
CONCLUSIONS: The results highlight LLMs' potential to assist with the SLR process. All three LLMs were comparable in decision match rate metric, whereas in our simulation, Model Bison showed better sensitivity score than GPT-4 and AI21 Ultra. However, the results should be interpreted cautiously as the results may vary with different research questions. Future research should consider analysing the performance of LLMs on larger datasets, variation in number of sample responses fed, and calibration around framing of screening rules for better understanding by AI.
Conference/Value in Health Info
Value in Health, Volume 26, Issue 11, S2 (December 2023)
Acceptance Code
P21
Topic
Methodological & Statistical Research, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis
Disease
no-additional-disease-conditions-specialized-treatment-areas