Training and Validation of CARAaiTM: A Multi LLM Platform and Data Model to Address Oncology-Specific Challenges in Clinical Data Extraction
Author(s)
Jennifer Rider, ScD1, Vivek P. Vaidya, BS2, Kuldeep Jiwani, BS2, Jeffrey Elton, PhD3, Louis Culot, MLA3, Pyeush Gurha, BS2;
1ConcertAI, Vice Preseident, Real World Evidence Services, Cambridge, MA, USA, 2ConcertAI, Bengaluru, India, 3ConcertAI, Cambridge, MA, USA
1ConcertAI, Vice Preseident, Real World Evidence Services, Cambridge, MA, USA, 2ConcertAI, Bengaluru, India, 3ConcertAI, Cambridge, MA, USA
Presentation Documents
OBJECTIVES: In oncology, performance status, tumor characteristics, biomarkers, treatments, and tumor progression or response allow for analysis of outcomes and effectiveness. These concepts are derived from the unstructured portion of patient EHR records. Historically, this information relied on time and resource-intensive human abstraction, limiting study sample sizes and extending time to insights months after the actual clinical activities. Large Language Models (LLM) are an alternative approach. However, oncology presents a unique challenge due to vagueness in terminology (e.g. “stage 3" referring to Chronic Kidney Disease or cancer stage). To enable use of LLMS with performance comparable to human curation, we used the ConcertAI Oncology Real-world data set, and trained and validated the “CARAaiTM platform of multiple oncology tuned LLMs.
METHODS: We validated the performance of the CARAaiTM models based on precision, recall, and the F1 score (the harmonic mean of precision and recall) using 50,000 patients across 13 solid tumor types (80% training set and 20% testing set). The same records processed via oncology-domain trained human clinical abstraction were used as the gold standard.
RESULTS: For performance status, tumor stage, histology, tumor grade, procedure type, metastatic diagnosis and medication, precision was >0.90 (±0.05), recall ranged from 0.91-0.99, and F1 scores were >0.95. Precision, recall and F1 scores were 0.95, 0.98, and 0.96 for biomarker names, 0.87, 0.84, and 0.85 for biomarker categorical results, and 0.86, 0.94, and 0.90 for biomarker numeric test results.
CONCLUSIONS: The CARAaiTM LLM suite achieved high precision with respect to human curation for oncology key data elements allowing larger data sets with lower latency. The CARAaiTM LLM models will facilitate improved statistical power and timeliness for HEOR and epidemiological studies on outcomes and safety.
METHODS: We validated the performance of the CARAaiTM models based on precision, recall, and the F1 score (the harmonic mean of precision and recall) using 50,000 patients across 13 solid tumor types (80% training set and 20% testing set). The same records processed via oncology-domain trained human clinical abstraction were used as the gold standard.
RESULTS: For performance status, tumor stage, histology, tumor grade, procedure type, metastatic diagnosis and medication, precision was >0.90 (±0.05), recall ranged from 0.91-0.99, and F1 scores were >0.95. Precision, recall and F1 scores were 0.95, 0.98, and 0.96 for biomarker names, 0.87, 0.84, and 0.85 for biomarker categorical results, and 0.86, 0.94, and 0.90 for biomarker numeric test results.
CONCLUSIONS: The CARAaiTM LLM suite achieved high precision with respect to human curation for oncology key data elements allowing larger data sets with lower latency. The CARAaiTM LLM models will facilitate improved statistical power and timeliness for HEOR and epidemiological studies on outcomes and safety.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
RWD116
Topic
Real World Data & Information Systems
Topic Subcategory
Reproducibility & Replicability
Disease
SDC: Oncology