VALIDATION OF AN AGENTIC LARGE LANGUAGE MODEL (LLM) SYSTEM IN THE EXTRACTION STAGE OF A REAL-TIME AI-ASSISTED LIVING SYSTEMATIC LITERATURE REVIEW (REAL-SLR): A SOLUTION TO INSTANT AND EASY ACCESS TO CLINICAL TRIAL DATA (CTD)
Author(s)
Rozee Liu, MSc1, Rhiannon Campden, PhD1, Eddie Xiaole Liu, BSc2, Triston grayston, BSc3, Oscar Correa, BSc3, Anna Forsythe, MBA, MSc, PharmD1;
1Oncoscope-AI, Miami, FL, USA, 2Independent, Toronto, ON, Canada, 3Eviviz Inc., Vancouver, BC, Canada
1Oncoscope-AI, Miami, FL, USA, 2Independent, Toronto, ON, Canada, 3Eviviz Inc., Vancouver, BC, Canada
OBJECTIVES: To address challenges faced by health economics and outcomes research (HEOR) professionals in staying current with clinical trial data (CTD) and the time-intensive nature of de novo systematic literature reviews (SLRs), we evaluated the feasibility of using an agentic large language model (LLM) to develop a REAL-SLR for CTD, assessing extraction accuracy and potential time savings.
METHODS: An agentic large language model (LLM) system was developed to autonomously generate clinical trial data extraction annotations without human input. The system combined multiple LLMs (OpenAI GPT-5 and GPT-4.1, Gemini 2.5 Pro, and Claude Sonnet 4.5) in a matrix of processes designed to emulate trained human reviewers by following a standardized annotation manual, decomposing tasks into subtasks, and recording reasoning for traceability. A retrieval-augmented generation (RAG) architecture with semantic embeddings, parallel population, intervention/comparator, outcome, study design (PICOS)-aligned extraction chains, and study-type-adaptive prompting was implemented. Annotations were generated for 32 extraction variables, and accuracy was evaluated against human annotations in publications for prostate (PC) and breast cancer (BC).
RESULTS: Our agentic LLM system generated annotations for 32 extraction variables for 3,098 (1,200 PC, 1,898 BC) publications. Twelve out of 32 variables achieved above 90% accuracy, 50% of which were above 95%. As an example, overall survival (OS) extraction includes 3 variables: OS measure, median, landmark. Our agentic system strictly adhered to the specified format: OS measure (“OS”), median (months; hazard ratio; confidence interval; p-value), landmark (percentage, p-value). The accuracy of 3 variables were 93.62%, 83.6%, and 90.9%.
CONCLUSIONS: The agentic LLM system with a RAG architecture demonstrated high accuracy in extracting publications. These findings suggest the system can enable real-time clinical data generation, supporting faster evidence development for HEOR decision-making and potentially improving patient access.
METHODS: An agentic large language model (LLM) system was developed to autonomously generate clinical trial data extraction annotations without human input. The system combined multiple LLMs (OpenAI GPT-5 and GPT-4.1, Gemini 2.5 Pro, and Claude Sonnet 4.5) in a matrix of processes designed to emulate trained human reviewers by following a standardized annotation manual, decomposing tasks into subtasks, and recording reasoning for traceability. A retrieval-augmented generation (RAG) architecture with semantic embeddings, parallel population, intervention/comparator, outcome, study design (PICOS)-aligned extraction chains, and study-type-adaptive prompting was implemented. Annotations were generated for 32 extraction variables, and accuracy was evaluated against human annotations in publications for prostate (PC) and breast cancer (BC).
RESULTS: Our agentic LLM system generated annotations for 32 extraction variables for 3,098 (1,200 PC, 1,898 BC) publications. Twelve out of 32 variables achieved above 90% accuracy, 50% of which were above 95%. As an example, overall survival (OS) extraction includes 3 variables: OS measure, median, landmark. Our agentic system strictly adhered to the specified format: OS measure (“OS”), median (months; hazard ratio; confidence interval; p-value), landmark (percentage, p-value). The accuracy of 3 variables were 93.62%, 83.6%, and 90.9%.
CONCLUSIONS: The agentic LLM system with a RAG architecture demonstrated high accuracy in extracting publications. These findings suggest the system can enable real-time clinical data generation, supporting faster evidence development for HEOR decision-making and potentially improving patient access.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR221
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
SDC: Oncology