IDENTIFYING NON-SMALL CELL LUNG CANCER PATIENTS FROM A COHORT OF HETEROGENEOUS LUNG CANCER PATIENTS USING BOOSTED TREES ON ELECTRONIC HEALTH RECORDS DATA

Author(s)

Chandrashekaraiah P¹, Rudeen K¹, Agrawal S², Thiruvenkadam S¹, Vaidya VP¹, Narayanan B³
¹Concerto Health AI, Boston, MA, USA, ²Concerto Health AI, Bangalore, KA, India, ³Concerto Health AI, bangalore, KA, India

Presentation Documents

AI_poster_IDENTIFYING NON-SMALL CELL LUNG CANCER PATIENTS FROM A COHORT OF HETEROGENOUS LUNG CANCER PATIENTS USING BOOSTED TREES ON ELECTRONIC HEALTH RECORDS DATA.pdf

OBJECTIVES: Ability to distinguish between subtypes of lung cancer (LC) is important for clinical outcomes and cost analysis, but this information is seldom captured in the structured electronic health record (EHR) data. The objective of this study was to develop and validate an artificial intelligence model to identify non-small cell lung cancer (NSCLC) patients from a cohort of heterogeneous LC patients using de-identified retrospective EHR data.

METHODS:Data from patients diagnosed with primary LC in the CancerLinQ database was used to build and test the model. Features from the structured EHR data used to build the model included medication information, surgery, cancer stage at diagnosis & metastatic status, age and gender. The model was built using gradient boosting, an algorithm that iteratively combines a set of decision trees into a single model. Out of ~105k LC patients in the database, 56,748 patients were labelled as either NSCLC (85%) or not NSCLC (15%). Most labelling was derived from curation of histology from unstructured notes by expert nurse curators. These data were divided into train (60%), validate (20%) and test (20%) datasets for creating and testing the model.

RESULTS:On the test set, the model had an AUC-ROC of 0.93 and overall accuracy of 93%. For identifying NSCLC patients, the precision (PPV) was 0.93 with recall 0.99. NPV was 0.92. This model compares favourably against a previously developed medications and tests based NSCLC case finding algorithm using claims data which had an AUC of 0.88 (Ralph et al, Front Pharmacol, 2017)

CONCLUSIONS:Machine learning methods can be used with structured EMR features and a curated gold-standard to develop and validate reliable indicators of clinical status in NSCLC. This could save substantial time and effort by quickly identifying patients for retrospective outcomes and cost studies as compared to expert manual curation.

Conference/Value in Health Info

2020-05, ISPOR 2020, Orlando, FL, USA

Value in Health, Volume 23, Issue 5, S1 (May 2020)

Code

PCN294

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Missing Data

Disease

Oncology

Explore Related HEOR by Topic

Methodology

Presentation