Enriching the Value of Real-World Oncology Data With Important Clinical Information From Unstructured Data Sources for Better Clinical Insight Generation

Speaker(s)

Verma V1, Rastogi M2, Paul A2, Gaur A2, Daral S2, Kukreja I3, Nayyar A2, Roy A2, Khan S1
1Optum, Gurgaon, HR, India, 2Optum, Gurugram, HR, India, 3Optum, New Delhi, DL, India

OBJECTIVES: Often structured EHR data lacks critical clinical information beyond primary diagnosis. In oncology, information on cancer staging, metastatic status, and biomarkers are essential to design precision medicine and targeted treatment pathways. The objective is to develop an NLP model capable of extracting relevant information from physician notes which can enrich the existing structured databases.

METHODS: Optum's de-identified Market Clarity Database were used to identify patients with primary colorectal cancer from January 2015 to December 2022. Patients with other concomitant cancer types were excluded. Clinical data elements associated with colorectal cancer (staging, metastatic status, MSI, and dMMR) were considered for data extraction. Physician notes (unstructured EHR data) from Optum's Physician Notes database were used after de-identifying notes by removing all PHI. Small sample of selected notes were manually annotated for clinical concepts, and sample was divided into train, test, and validation sets for NLP model development. We employed machine learning-based classification models (NER classification) and healthcare embeddings to identify relevant clinical texts from patients' notes. Model's accuracy was assessed in terms of precision, recall, and F1 scores at both the keyword and instance level for each concept.

RESULTS: Out of 2,541 colorectal cancer patients, only 659 (25.9%) had colorectal as their primary cancer type and had at least 1 physician note. The total number of physician notes associated with these patients were found to be 104,360. Out of all the NLP models, Healthcare embeddings yielded better accuracy for all the concepts. For clinical stage prediction, MSI and dMMR, the model achieved F1 score of 0.95, 0.88 and 0.86 respectively.

CONCLUSIONS: Healthcare embedding was found to be the better predictor for cancer stage and other biomarkers for colorectal cancer. The model prediction for other cancer types or other biomarkers may differ. Further training with specific cancer type and biomarkers is required to achieve desired accuracy.

Code

RWD62

Topic

Clinical Outcomes, Real World Data & Information Systems

Topic Subcategory

Clinical Outcomes Assessment, Data Protection, Integrity, & Quality Assurance

Disease

Oncology