Developing an HER-Based Multifeature Machine Learning Model to Identify Lung Cancer Subtype
Author(s)
Sheenu Chandwani, MPH, PhD1, Vandana Priya, MTech2, Vivek Prabhakar Vaidya, BSc2.
1RWD/E Solutions, ConcertAI, LLC, Cambridge, MA, USA, 2ConcertAI, LLC, Bengaluru, India.
1RWD/E Solutions, ConcertAI, LLC, Cambridge, MA, USA, 2ConcertAI, LLC, Bengaluru, India.
OBJECTIVES: Delineating lung cancer subtypes, NSCLC and SCLC, is a key research need and well documented limitation of claims data. Relying on expert determined pathologic evidence of subtype from EHRs is limiting and resource intensive. We established a scalable ML-based model that leverages multiple features from EHR to identify lung subtype.
METHODS: EHR notes from US representative ConcertAI network were accessed for patients with a C34 code. A training set of 7,914 patients was used. For each EHR document, snippets were labeled into NSCLC or SCLC based on exact tumor name or synonyms, stage (extensive, limited for SCLC), or histology (eg: adenocarcinoma for NSCLC). Evidence of subtype is first asserted, then associated temporally and semantically with primary tumor. Then a hybrid rules+ML model is applied at patient level to integrate evidence and resolve contradictions; if unresolved, no prediction is made. A sample of 50 patients predicted as NSCLC and SCLC each (validation set) were compared to expert determined subtype from the EHR. Finally, the model was applied to a larger test cohort and clinical relevance assessed via systemic treatment distribution.
RESULTS: For the training set, the model predicted 67.5% (5,346) NSCLC, 17.7% (1,398) SCLC, and 14.8% (1,170) other patients. Expert validation revealed precision, recall, and specificity of 0.96, 0.87, and 0.93, respectively for NSCLC and 0.92, 0.92, and 0.96, respectively for SCLC. The test set comprised of 432,453 patients, and model predicted 88.8% (375,241) NSCLC, 9.3% (40,324) SCLC, and 3.9% (16,888) other. Top three regimen in first-line advanced setting were platinum-doublet, pembrolizumab+/-chemotherapy, and EGFR TKis for NSCLC and, etoposide+platinum, atezolizumab/durvalumab+platinum+etoposide, and topoisomerase inhibitors for SCLC.
CONCLUSIONS: ML-based model that leverages multiple features from structured and unstructured EHR can reliably classify NSCLC and SCLC subtypes validated through alignment with real-world treatment patterns supporting its utility for studying disease phenotype and associated treatment patterns and outcomes.
METHODS: EHR notes from US representative ConcertAI network were accessed for patients with a C34 code. A training set of 7,914 patients was used. For each EHR document, snippets were labeled into NSCLC or SCLC based on exact tumor name or synonyms, stage (extensive, limited for SCLC), or histology (eg: adenocarcinoma for NSCLC). Evidence of subtype is first asserted, then associated temporally and semantically with primary tumor. Then a hybrid rules+ML model is applied at patient level to integrate evidence and resolve contradictions; if unresolved, no prediction is made. A sample of 50 patients predicted as NSCLC and SCLC each (validation set) were compared to expert determined subtype from the EHR. Finally, the model was applied to a larger test cohort and clinical relevance assessed via systemic treatment distribution.
RESULTS: For the training set, the model predicted 67.5% (5,346) NSCLC, 17.7% (1,398) SCLC, and 14.8% (1,170) other patients. Expert validation revealed precision, recall, and specificity of 0.96, 0.87, and 0.93, respectively for NSCLC and 0.92, 0.92, and 0.96, respectively for SCLC. The test set comprised of 432,453 patients, and model predicted 88.8% (375,241) NSCLC, 9.3% (40,324) SCLC, and 3.9% (16,888) other. Top three regimen in first-line advanced setting were platinum-doublet, pembrolizumab+/-chemotherapy, and EGFR TKis for NSCLC and, etoposide+platinum, atezolizumab/durvalumab+platinum+etoposide, and topoisomerase inhibitors for SCLC.
CONCLUSIONS: ML-based model that leverages multiple features from structured and unstructured EHR can reliably classify NSCLC and SCLC subtypes validated through alignment with real-world treatment patterns supporting its utility for studying disease phenotype and associated treatment patterns and outcomes.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR69
Topic
Methodological & Statistical Research, Real World Data & Information Systems
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
Oncology