Developing an HER-Based Multifeature Machine Learning Model to Identify Lung Cancer Subtype

Author(s)

Sheenu Chandwani, MPH, PhD1, Vandana Priya, MTech2, Vivek Prabhakar Vaidya, BSc2.
1RWD/E Solutions, ConcertAI, LLC, Cambridge, MA, USA, 2ConcertAI, LLC, Bengaluru, India.
OBJECTIVES: Delineating lung cancer subtypes, NSCLC and SCLC, is a key research need and well documented limitation of claims data. Relying on expert determined pathologic evidence of subtype from EHRs is limiting and resource intensive. We established a scalable ML-based model that leverages multiple features from EHR to identify lung subtype.
METHODS: EHR notes from US representative ConcertAI network were accessed for patients with a C34 code. A training set of 7,914 patients was used. For each EHR document, snippets were labeled into NSCLC or SCLC based on exact tumor name or synonyms, stage (extensive, limited for SCLC), or histology (eg: adenocarcinoma for NSCLC). Evidence of subtype is first asserted, then associated temporally and semantically with primary tumor. Then a hybrid rules+ML model is applied at patient level to integrate evidence and resolve contradictions; if unresolved, no prediction is made. A sample of 50 patients predicted as NSCLC and SCLC each (validation set) were compared to expert determined subtype from the EHR. Finally, the model was applied to a larger test cohort and clinical relevance assessed via systemic treatment distribution.
RESULTS: For the training set, the model predicted 67.5% (5,346) NSCLC, 17.7% (1,398) SCLC, and 14.8% (1,170) other patients. Expert validation revealed precision, recall, and specificity of 0.96, 0.87, and 0.93, respectively for NSCLC and 0.92, 0.92, and 0.96, respectively for SCLC. The test set comprised of 432,453 patients, and model predicted 88.8% (375,241) NSCLC, 9.3% (40,324) SCLC, and 3.9% (16,888) other. Top three regimen in first-line advanced setting were platinum-doublet, pembrolizumab+/-chemotherapy, and EGFR TKis for NSCLC and, etoposide+platinum, atezolizumab/durvalumab+platinum+etoposide, and topoisomerase inhibitors for SCLC.
CONCLUSIONS: ML-based model that leverages multiple features from structured and unstructured EHR can reliably classify NSCLC and SCLC subtypes validated through alignment with real-world treatment patterns supporting its utility for studying disease phenotype and associated treatment patterns and outcomes.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

MSR69

Topic

Methodological & Statistical Research, Real World Data & Information Systems

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

Oncology

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×