IDENTIFYING NON-SMALL CELL LUNG CANCER PATIENTS FROM A COHORT OF HETEROGENEOUS LUNG CANCER PATIENTS USING BOOSTED TREES ON ELECTRONIC HEALTH RECORDS DATA
Chandrashekaraiah P1, Rudeen K1, Agrawal S2, Thiruvenkadam S1, Vaidya VP1, Narayanan B3
1Concerto Health AI, Boston, MA, USA, 2Concerto Health AI, Bangalore, KA, India, 3Concerto Health AI, bangalore, KA, India
OBJECTIVES: Ability to distinguish between subtypes of lung cancer (LC) is important for clinical outcomes and cost analysis, but this information is seldom captured in the structured electronic health record (EHR) data. The objective of this study was to develop and validate an artificial intelligence model to identify non-small cell lung cancer (NSCLC) patients from a cohort of heterogeneous LC patients using de-identified retrospective EHR data. METHODS:Data from patients diagnosed with primary LC in the CancerLinQ database was used to build and test the model. Features from the structured EHR data used to build the model included medication information, surgery, cancer stage at diagnosis & metastatic status, age and gender. The model was built using gradient boosting, an algorithm that iteratively combines a set of decision trees into a single model. Out of ~105k LC patients in the database, 56,748 patients were labelled as either NSCLC (85%) or not NSCLC (15%). Most labelling was derived from curation of histology from unstructured notes by expert nurse curators. These data were divided into train (60%), validate (20%) and test (20%) datasets for creating and testing the model. RESULTS:On the test set, the model had an AUC-ROC of 0.93 and overall accuracy of 93%. For identifying NSCLC patients, the precision (PPV) was 0.93 with recall 0.99. NPV was 0.92. This model compares favourably against a previously developed medications and tests based NSCLC case finding algorithm using claims data which had an AUC of 0.88 (Ralph et al, Front Pharmacol, 2017) CONCLUSIONS:Machine learning methods can be used with structured EMR features and a curated gold-standard to develop and validate reliable indicators of clinical status in NSCLC. This could save substantial time and effort by quickly identifying patients for retrospective outcomes and cost studies as compared to expert manual curation.
Conference/Value in Health Info
2020-05, ISPOR 2020, Orlando, FL, USA
Methodological & Statistical Research
Artificial Intelligence, Machine Learning, Predictive Analytics, Missing Data