Development of a Retrieval-Augmented Generation Pipeline Leveraging Large Language Models to Support Evidence Synthesis

Speaker(s)

Perera C, Heron L, Hirst A
Adelphi Values PROVE, Bollington, Cheshire, UK

OBJECTIVES: With the advancement of large language models (LLMs) and their capability to achieve human-level performance, there is an opportunity to revolutionize resource-intensive tasks such as evidence synthesis. This research seeks to develop a retrieval-augmented generation pipeline that is able to support the extraction of unstructured data from Portable Document Format (PDF) files using both proprietary and open-source LLMs.

METHODS: Academic journal publications are widely distributed as PDFs, for this case-study we considered a key-trial publication in advanced breast cancer reporting time-to-event outcomes. A Python script was developed to interact with proprietary pre-trained LLMs (Open AI GPT-4o and GPT-3.5-turbo) via application programming interface and with open-source pre-trained LLMs (Meta AI LLaMa-2 and LLaMa-3). The LLMs were prompted to extract data from the publication through pre-specified queries and were validated against a data extraction, conducted in parallel, by evidence synthesis experts.

RESULTS: The analysis found that the proprietary models outperformed the open-source models. Both GPT-4o and GPT-3.5-turbo scored 100% against the human reviewer. On the other hand, open-source models performed with slightly less accuracy. LLaMa-2 extracted 60% of the items correctly, whilst LLaMa-3 extracted 80% of the items correctly. All models took around 30 seconds to extract the data.

CONCLUSIONS: Whilst OpenAI’s GPT models extracted the correct data, open-source LLMs still represent an efficient alternative given their low cost and accurate responses to prompts. This study provides a promising indication of the feasibility of LLMs to semi-automate data extraction to support evidence synthesis. With the continuous advancement of state-of-the-art open-source LLMs, the potential for future research in the context of evidence synthesis automation is limitless. This includes the automation of risk-of-bias assessment as an example.

Code

MSR1

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas