A Comparative Analysis of Large Language Models (LLM) Utilised in Systematic Literature Review

Author(s)

Rathi H¹, Malik A², Behera DC², Kamboj G³
¹Skyward Analytics Pvt. Ltd., Gurugram, Haryana, India, ²EasySLR Pvt. Ltd., Gurugram, Haryana, India, ³Skyward Analytics Pvt. Ltd., New Delhi, DL, India

Presentation Documents

EU ISPOR 2023_Podium presentation_v1.0132550.pdf

OBJECTIVES: The objective of this study was to conduct a comparison of three Large Language Models (LLMs) — AI21 Ultra, OpenAI GPT-4, and Google Vertex Artificial Intelligence (AI) Model Bison in their application to primary screening during a systematic literature review (SLR).

METHODS: We fed three to five sample responses and identical screening rules to all LLMs for primary screening (title and abstract screening) of 100 studies. We compared the decision made by the human reviewer, assumed as reference response, to gauge the performance of the LLMs. Models were assessed on decision match rate (defined as cases where inclusion and exclusion decisions were identical between the human reviewer and LLM) and sensitivity score (defined as the number of correct inclusions by LLM to overall inclusions by the human reviewer).

RESULTS: Model Bison, GPT-4, and AI21 Ultra scored a decision match rate of 67.01%, 65.56%, and 64.0%, respectively. Model Bison had the highest sensitivity score of 0.90, followed by 0.74 and 0.71 for GPT-4 and AI21 Ultra, respectively. In scenario analysis, we noted that the performance metrics of all LLMs varied substantially based on the amends in screening rules, the number of sample responses fed, and the number of studies analysed. While the decision match rate dropped to 51% in a few scenarios, the sensitivity score increased to 0.97 in others.

CONCLUSIONS: The results highlight LLMs' potential to assist with the SLR process. All three LLMs were comparable in decision match rate metric, whereas in our simulation, Model Bison showed better sensitivity score than GPT-4 and AI21 Ultra. However, the results should be interpreted cautiously as the results may vary with different research questions. Future research should consider analysing the performance of LLMs on larger datasets, variation in number of sample responses fed, and calibration around framing of screening rules for better understanding by AI.

Conference/Value in Health Info

2023-11, ISPOR Europe 2023, Copenhagen, Denmark

Value in Health, Volume 26, Issue 11, S2 (December 2023)

Acceptance Code

P21

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Literature Review & Synthesis

Disease

no-additional-disease-conditions-specialized-treatment-areas

Explore Related HEOR by Topic

Methodology

Presentation (Paper)