A MACHINE LEARNING MODEL FOR CANCER BIOMARKER IDENTIFICATION IN ELECTRONIC HEALTH RECORDS

Author(s)

Ambwani G, Cohen A, Estévez M, Singh N, Adamson B, Nussbaum NC, Birnbaum B
Flatiron Health, New York, NY, USA

Presentation Documents

ml_ambwani_poster.pdf

OBJECTIVES

Identifying biomarker-defined patient cohorts using electronic health records (EHR) data is important for facilitating real-world outcomes research in precision oncology. Human abstraction is needed to find such cohorts because biomarker results are captured in unstructured fields and require interpretation. We aimed to develop a classification algorithm using machine learning (ML) for prediction of a patient’s biomarker status to reduce the volume of manual abstraction effort.

METHODS

Patient records and standard-of-care biomarkers from four diseases in the Flatiron Health EHR-derived database were used for model training and testing: metastatic colorectal cancer (mCRC: KRAS, NRAS, BRAF, MSI), metastatic breast cancer (mBreast: ER, PR, HER2), advanced melanoma (aMel: BRAF, KIT, NRAS, PDL1), and advanced non-small cell lung cancer (aNSCLC: EGFR, ALK, ROS1, KRAS, PDL1). Using abstracted biomarker status as labeled data, we trained a regularized logistic regression model on a normalized term frequency vector derived from patient records. The model identifies patients likely to have a positive biomarker; they are subsequently sent for confirmatory chart abstraction. Sensitivity and abstraction savings (defined as percent of patient charts not requiring review) were computed.

RESULTS

We randomly selected 18,100 patients (3291 mCRC, 2409 mBreast, 1329 aMel, 11,071 aNSCLC). The median (IQR) recorded biomarker-positive patient proportion across all disease-biomarkers pairs was 4.5% (2.0%-23.8%). There were 4,525 patients in the training set and 13,575 in the test set. Across disease-biomarker pairs, the median (IQR) sensitivity was 97.3% (91.9%-99.6%), and the median (IQR) abstraction savings was 64.2% (23.6%-78.8%).

CONCLUSIONS

This ML classification model is highly sensitive, permitting increased efficiency in identification of patients’ treatment-relevant biomarkers in EHR data. This enables a scalable method for the creation of biomarker-defined cohorts, reduces the need for costly human chart abstraction, and improves our ability to study real-world outcomes in precision oncology.

Conference/Value in Health Info

2019-05, ISPOR 2019, New Orleans, LA, USA

Value in Health, Volume 22, Issue S1 (2019 May)

Code

PPM8

Topic

Methodological & Statistical Research

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

Personalized and Precision Medicine

Explore Related HEOR by Topic

Methodology

Presentation