Performance Assessment and Validation of Real-World Response Data Generated Using a Deep Learning-Based Natural Language Processing Model Across Multiple Solid Tumors

Author(s)

Kelly Magee, MS, RN, Qianyu Yuan, PhD, Auriane Blarre, MEng, Aaron B. Cohen, MD, MSCE, Aaron Dolor, PhD, Konstantin Krismer, PhD, Tori Williams, BA, Qianyi Zhang, MS;
Flatiron Health, New York, NY, USA

Presentation Documents

OBJECTIVES: This study describes the reliability, completeness, and internal validity of a novel machine learning (ML)-generated real-world response (rwR) approach.
METHODS: This study used the nationwide, Flatiron Health electronic health record (EHR)-derived, de-identified database. A deep learning-based, natural language processing model extracted clinicians’ documentation of changes in disease burden (complete response [CR], partial response [PR], stable disease, progressive disease, or unknown) at imaging timepoints. Data from 18 treatment and/or biomarker-defined cohorts across 7 solid tumors were used to train the model and test the correlation between human-abstracted and ML-extracted real-world response rate (rwRR). In 15 cohorts of common solid tumors, the proportion of treated patients with at least 1 assessment and the time to first, second, third, and median number of assessments for first (1L), second (2L), and third lines (3L) of therapies, were examined. Additionally, real-world overall survival (rwOS) was compared for responders (ever achieved CR or PR) versus non-responders (never achieved CR or PR) for the most frequent regimens in 1L to 3L for each disease (with ≥30 patients).
RESULTS: Within the test cohort (n = 4047), the correlation between human-abstracted and ML-extracted rwRR was r = 0.86. The solid tumor cohorts included 3406-129 807 treated patients. 57.8%-80.6% of patients had at least 1 assessment, with a median of 1-3 assessments within 1L, 2L, and 3L. Median times to first, second, and third assessments for 1L-3L were 1.9-4.4, 3.8-8.7, and 5.7-12.9 months, respectively. Across all most frequent 1L-3L regimens for each disease, responders for each cohort had significantly longer survival compared to non-responders (P < .05).
CONCLUSIONS: This study establishes the performance and validation of a novel ML approach for capturing rwR data from EHRs; supporting the efficient and reliable generation of valuable outcome data across large cohorts.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

MSR142

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

No Additional Disease & Conditions/Specialized Treatment Areas, SDC: Oncology

Presentation (CTI)