Integrating Large Language Models Into an Existing Review Process: Promises and Pitfalls
Author(s)
Edwards M1, Ferrante di Ruffano L2
1York Health Economics Consortium, York, YOR, UK, 2York Health Economics Consortium, York, NYK, UK
Presentation Documents
OBJECTIVES: The recent development and rise in accessibility of large language models (LLMs) has generated excitement around their possibilities for reducing the resource burden of conducting reviews. Following testing, we assessed the cost, accuracy, and accessibility of LLMs to reviewers, and consider what types of reviews LLMs are currently best suited to assist with.
METHODS: We conducted internal testing of a LLM, Claude 3 Opus, via the chat interface. We used the tool to conduct high level data extraction for a targeted review, highly granulated extraction for a systematic review, and risk of bias assessment of RCTs.
RESULTS: The LLM via a chat interface was highly accessible, inexpensive, and saved significant time in conducting high level qualitative data extraction for a pragmatic review. Outputs were standardized and easy to manipulate and integrate into our existing work process. Extracting accurate granular data for a systematic review proved more difficult, with the model failing to interpret complexities of patient flow, struggling to respond accurately to lengthy, detailed prompts, and the subsequent checking, correcting, and formatting outweighing any time saved. The model identified some relevant content for conducting risk of bias assessment with the Cochrane RoB 1 tool, although lacked context, and human judgement was needed for final decision making.
CONCLUSIONS: LLM chat interfaces offer significant time savings for pragmatic reviews, although copyright issues exist in uploading published papers for synthesis. Optimal performance for systematic reviews is unlikely to be achieved without fine tuning a version of the model with archive data. This process is currently costly, commercial confidentiality must be considered, and the skill set required is outside the scope of many review teams. Developers should ensure that any LLM based tools for reviewing can be integrated into clients’ existing processes with the use of standardized import and export formats such as CSV or RIS.
Conference/Value in Health Info
Value in Health, Volume 27, Issue 12, S2 (December 2024)
Code
MSR118
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas