Accelerating EHR Insights: NLP-Driven Data Abstraction in Gallbladder Cancer
Author(s)
Alisha Monnette, PhD, MPH, Robert Reid, MD, Avi Raju, MS, MPH, Gayathri Namasivayam, PhD, Junxin Shi, PhD, Wendy Haydon, MSN, RN, Wanmei Ou, PhD;
Ontada, Boston, MA, USA
Ontada, Boston, MA, USA
Presentation Documents
OBJECTIVES: Gallbladder cancer (GBC) is a rare and aggressive malignancy with limited treatment options, making accurate staging and histological classification critical for patient care and research. However, variables, such as TNM staging and histologic subtypes may have low completion rates in structured fields of electronic health records (EHRs). To address this, we utilized natural language processing (NLP) to extract TNM staging and histology data from unstructured EHR documents to improve data completeness and accuracy.
METHODS: NLP was applied to unstructured EHR documents, including clinical notes and scanned pathology reports, within iKnowMed (iKM), an EHR system used by US Oncology-affiliated clinics. TNM staging was extracted from progress notes and histology data from pathology reports using optical character recognition and large language models (Azure Document Intelligence and OpenAI GPT-4o). Between 2013 and 2023, 2,019 patients with GBC were identified via structured data; 51.3% (N=1,035) had missing TNM staging and 56.7% (N=1,144) had missing histologic subtyping.
RESULTS: NLP identified at least one TNM staging value for 208 of 1,035 patients (20%) from clinical notes within 90 days of diagnosis, increasing the completion rate for the clinical stage variable from 48.7% to 59%. For histology, NLP identified data for 771 of 1,144 patients (66.8%) within 90 days, raising the completion rate from 43.3% to 81.5%.
CONCLUSIONS: NLP enhanced the capture of TNM staging and histology data for GBC, addressing gaps in structured EHR-sourced data and supporting the creation of patient cohorts. These results demonstrate the potential for NLP to improve real-world data capture and advance regulatory and clinical research in rare cancers.
METHODS: NLP was applied to unstructured EHR documents, including clinical notes and scanned pathology reports, within iKnowMed (iKM), an EHR system used by US Oncology-affiliated clinics. TNM staging was extracted from progress notes and histology data from pathology reports using optical character recognition and large language models (Azure Document Intelligence and OpenAI GPT-4o). Between 2013 and 2023, 2,019 patients with GBC were identified via structured data; 51.3% (N=1,035) had missing TNM staging and 56.7% (N=1,144) had missing histologic subtyping.
RESULTS: NLP identified at least one TNM staging value for 208 of 1,035 patients (20%) from clinical notes within 90 days of diagnosis, increasing the completion rate for the clinical stage variable from 48.7% to 59%. For histology, NLP identified data for 771 of 1,144 patients (66.8%) within 90 days, raising the completion rate from 43.3% to 81.5%.
CONCLUSIONS: NLP enhanced the capture of TNM staging and histology data for GBC, addressing gaps in structured EHR-sourced data and supporting the creation of patient cohorts. These results demonstrate the potential for NLP to improve real-world data capture and advance regulatory and clinical research in rare cancers.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
RWD36
Topic
Real World Data & Information Systems
Topic Subcategory
Data Protection, Integrity, & Quality Assurance
Disease
SDC: Oncology, SDC: Rare & Orphan Diseases