Accelerating EHR Insights: NLP-Driven Data Abstraction in Gallbladder Cancer

Author(s)

Alisha Monnette, PhD, MPH, Robert Reid, MD, Avi Raju, MS, MPH, Gayathri Namasivayam, PhD, Junxin Shi, PhD, Wendy Haydon, MSN, RN, Wanmei Ou, PhD;
Ontada, Boston, MA, USA
OBJECTIVES: Gallbladder cancer (GBC) is a rare and aggressive malignancy with limited treatment options, making accurate staging and histological classification critical for patient care and research. However, variables, such as TNM staging and histologic subtypes may have low completion rates in structured fields of electronic health records (EHRs). To address this, we utilized natural language processing (NLP) to extract TNM staging and histology data from unstructured EHR documents to improve data completeness and accuracy.
METHODS: NLP was applied to unstructured EHR documents, including clinical notes and scanned pathology reports, within iKnowMed (iKM), an EHR system used by US Oncology-affiliated clinics. TNM staging was extracted from progress notes and histology data from pathology reports using optical character recognition and large language models (Azure Document Intelligence and OpenAI GPT-4o). Between 2013 and 2023, 2,019 patients with GBC were identified via structured data; 51.3% (N=1,035) had missing TNM staging and 56.7% (N=1,144) had missing histologic subtyping.
RESULTS: NLP identified at least one TNM staging value for 208 of 1,035 patients (20%) from clinical notes within 90 days of diagnosis, increasing the completion rate for the clinical stage variable from 48.7% to 59%. For histology, NLP identified data for 771 of 1,144 patients (66.8%) within 90 days, raising the completion rate from 43.3% to 81.5%.
CONCLUSIONS: NLP enhanced the capture of TNM staging and histology data for GBC, addressing gaps in structured EHR-sourced data and supporting the creation of patient cohorts. These results demonstrate the potential for NLP to improve real-world data capture and advance regulatory and clinical research in rare cancers.

Conference/Value in Health Info

2025-05, ISPOR 2025, Montréal, Quebec, CA

Value in Health, Volume 28, Issue S1

Code

RWD36

Topic

Real World Data & Information Systems

Topic Subcategory

Data Protection, Integrity, & Quality Assurance

Disease

SDC: Oncology, SDC: Rare & Orphan Diseases

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×