An AI-Powered RAG Based Framework for Data Extractions in Systematic Literature Reviews
Author(s)
Rajdeep Kaur, PhD1, Barinder Singh, RPh2, Pankaj Rai, MS1, Vedant Soni, BE1, Sunil Kumar, M.Pharm1, Mrinal Mayank, BE1;
1Pharmacoevidence, Mohali, India, 2Pharmacoevidence, London, United Kingdom
1Pharmacoevidence, Mohali, India, 2Pharmacoevidence, London, United Kingdom
Presentation Documents
OBJECTIVES: Data extraction in Systematic Literature Reviews (SLR) is important step to collect detailed information from the included studies. The aim of the study was to develop an AI automated RAG driven platform to streamline the extraction of relevant information from included studies in SLRs, reducing the time required for data extractions.
METHODS: Embase and MEDLINE databases were searched to identify cost-burden studies conducted in patients with Retinitis Pigmentosa (RP) published in the last 15-year timeframe (2009 to 2024). A dynamic Retrieval-Augmented Generation (RAG) pipeline was developed to standardize the content in the articles using Optical Character Recognition. Then the standardized content was divided into small chunks, and embeddings were stored in the vector database. A multi-agentic approach was used in this framework to extract the relevant information. Domain experts with at least 10 years of domain experience evaluated the extraction results and conducted cross verification against the data extraction grid to ensure the accuracy and consistency.
RESULTS: The SLR included a total six studies conducted across the United States (US) (n=2), Japan (n=2), Spain (n=1), and globally (US and Canada, n=1). The AI platform was used to extract the study characteristics, population characteristics, direct and indirect cost outcomes, and key findings from the included studies. Domain experts rated the AI-extracted outcomes 92% of the responses as “strongly agree”. However, in two instances (approximately 8% of the tested prompts), the AI introduced extra content, noise, or hallucinations, with one notable inaccuracy involving cross-referenced data from a linked study
CONCLUSIONS: The development of the AI-powered RAG based framework represented a significant advancement in automating extraction phase of the SLRs. Future work will focus on expanding the capabilities of the system to handle more complex extraction scenarios involving linked studies and data presented in graphs, tables, and figures.
METHODS: Embase and MEDLINE databases were searched to identify cost-burden studies conducted in patients with Retinitis Pigmentosa (RP) published in the last 15-year timeframe (2009 to 2024). A dynamic Retrieval-Augmented Generation (RAG) pipeline was developed to standardize the content in the articles using Optical Character Recognition. Then the standardized content was divided into small chunks, and embeddings were stored in the vector database. A multi-agentic approach was used in this framework to extract the relevant information. Domain experts with at least 10 years of domain experience evaluated the extraction results and conducted cross verification against the data extraction grid to ensure the accuracy and consistency.
RESULTS: The SLR included a total six studies conducted across the United States (US) (n=2), Japan (n=2), Spain (n=1), and globally (US and Canada, n=1). The AI platform was used to extract the study characteristics, population characteristics, direct and indirect cost outcomes, and key findings from the included studies. Domain experts rated the AI-extracted outcomes 92% of the responses as “strongly agree”. However, in two instances (approximately 8% of the tested prompts), the AI introduced extra content, noise, or hallucinations, with one notable inaccuracy involving cross-referenced data from a linked study
CONCLUSIONS: The development of the AI-powered RAG based framework represented a significant advancement in automating extraction phase of the SLRs. Future work will focus on expanding the capabilities of the system to handle more complex extraction scenarios involving linked studies and data presented in graphs, tables, and figures.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
MSR149
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas