ASSESSING QUALITY OF A LARGE LANGUAGE MODEL (LLM)-DERIVED PROSTATE CANCER (PC) REAL-WORLD DATASET: AN APPLICATION OF THE VALIDATION OF ACCURACY FOR LLM/ML-EXTRACTED INFORMATION AND DATA (VALID) FRAMEWORK

Author(s)

Patrick J. Ward, PhD, MPH, Yunzhi Qian, PhD, MPH, Eunice A. Hankinson, MSN, FNP-C, Aaron Dolor, PhD, Melissa Estevez, MS.
Flatiron Health, New York, NY, USA.

OBJECTIVES: The VALID framework assesses LLM-derived real-world data (RWD) quality across three dimensions: variable-level metrics (VLM), verification checks, and replication analyses. This study applied VALID to a novel, LLM-derived PC dataset to determine suitability for generating real-world evidence (RWE).
METHODS: LLMs selected patients with PC from the US-based, electronic health record-derived, deidentified Flatiron Health Research Database and extracted clinically meaningful characteristics, including initial/metastatic diagnosis, castration-resistant PC (CRPC) or hormone-sensitive PC (HSPC) status, and treatment information. LLM-derived data were compared with an abstracted metastatic PC dataset. For VLM, test sets of 349-500 patients were doubly abstracted. Verification checks assessed the proportion of patients who received >1 line of systemic therapy for metastatic-HSPC (mHSPC); replication assessed real-world overall survival (rwOS) performance in treatment-selected cohorts.
RESULTS: The LLM-derived dataset included 373,524 patients with PC. For VLM, the F1 score for initial diagnosis and date was 2.10% lower for LLM than for abstractors; metastatic diagnosis and date was 2.11% lower; and CRPC/HSPC status was 0.52% lower. The percentage of patients having >1 mHSPC line of therapy was 3.3% higher in the LLM-derived dataset than in the abstracted comparator. Replication showed similar rwOS patterns between the abstracted and LLM-derived datasets in treatment-selected cohorts: patients treated with androgen receptor pathway inhibitors (ARPI) during first-line therapy in the metastatic-CRPC (mCRPC) setting had similar median rwOS (months, 95% CI) between the LLM-derived (25.3, 24.9-25.8) and abstracted (24.4, 23.7-25.1) datasets; rwOS of patients treated with poly(ADP-ribose) polymerase inhibitors (PARPi) during second-line therapy in the mCRPC setting also had similar rwOS between the two datasets (LLM: 15.8, 13.9-17.2 vs abstracted: 15.9, 14.5-17.8).
CONCLUSIONS: The VALID framework provided a multifaceted approach to assessing LLM-derived RWD quality. Applying the VALID framework to a large PC dataset indicated that LLMs can be used to extract data suitable for generating accurate and reliable RWE.

Conference/Value in Health Info

2026-05, ISPOR 2026, Philadelphia, PA, USA

Value in Health, Volume 29, Issue S6

Code

MSR65

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

SDC: Oncology, STA: Multiple/Other Specialized Treatments

Presentation (CTI)