Utilizing NLP to Enhance EHR Data Abstraction Accuracy in Waldenström Macroglobulinemia
Author(s)
Alisha Monnette, PhD, MPH, Robert Reid, MD, Debra Rembert, MSN, RN, Avi Raju, MS, MPH, Gayathri Namasivayam, PhD, Junxin Shi, PhD, Wanmei Ou, PhD;
Ontada, Boston, MA, USA
Ontada, Boston, MA, USA
Presentation Documents
OBJECTIVES: Accurate cancer diagnosis data is critical for research but can be compromised by historical diagnosis predating a healthcare site or unconfirmed terms in EHRs. To improve data completeness and accuracy, natural language processing (NLP) was applied to extract diagnoses and dates from unstructured pathology reports for Waldenström macroglobulinemia (WM), a rare B-cell neoplasm.
METHODS: We developed an NLP pipeline using optical character recognition and a large language model to extract diagnosis names and dates from pathology reports in iKnowMed, an EHR used by US Oncology clinics. The pipeline utilized Azure Document Intelligence for text extraction and OpenAI’s GPT-4 for information retrieval. From 2013 to 2023, 3,350 patients with a WM diagnosis were identified using structured data; 514 had related terms (e.g., lymphoma, myeloma) or a potential second primary. Among these, 880 lacked structured diagnosis dates. Pathology reports for these patients were processed through NLP, with output validated by clinicians against structured data.
RESULTS: Of 880 missing diagnosis dates, NLP excluded 485 (55%) due to missing WM terms. Of the remaining 395, clinician review confirmed 284 with correct diagnoses and dates and excluded 111 (89 for dates outside the study period, 22 for vague B-cell lymphoma diagnoses). NLP eliminated clinician validation needs for 485 cases. Of 514 with related terms, NLP excluded 114 (22%) for missing WM terms. Among the remaining 400, clinician review confirmed 257 WM diagnoses (235 without a second primary, 22 with a second primary) and excluded 143 without a confirmed WM diagnosis. This automation saved approximately 440 hours, reducing data abstraction from six months (manual-only) to one month, achieving a fivefold acceleration.
CONCLUSIONS: NLP improved data accuracy by extracting confirmed WM diagnoses and dates, addressing EHR gaps, and reducing manual abstraction. This streamlined process accelerated timelines and enhanced research quality, demonstrating the potential of NLP for real-world data improvement.
METHODS: We developed an NLP pipeline using optical character recognition and a large language model to extract diagnosis names and dates from pathology reports in iKnowMed, an EHR used by US Oncology clinics. The pipeline utilized Azure Document Intelligence for text extraction and OpenAI’s GPT-4 for information retrieval. From 2013 to 2023, 3,350 patients with a WM diagnosis were identified using structured data; 514 had related terms (e.g., lymphoma, myeloma) or a potential second primary. Among these, 880 lacked structured diagnosis dates. Pathology reports for these patients were processed through NLP, with output validated by clinicians against structured data.
RESULTS: Of 880 missing diagnosis dates, NLP excluded 485 (55%) due to missing WM terms. Of the remaining 395, clinician review confirmed 284 with correct diagnoses and dates and excluded 111 (89 for dates outside the study period, 22 for vague B-cell lymphoma diagnoses). NLP eliminated clinician validation needs for 485 cases. Of 514 with related terms, NLP excluded 114 (22%) for missing WM terms. Among the remaining 400, clinician review confirmed 257 WM diagnoses (235 without a second primary, 22 with a second primary) and excluded 143 without a confirmed WM diagnosis. This automation saved approximately 440 hours, reducing data abstraction from six months (manual-only) to one month, achieving a fivefold acceleration.
CONCLUSIONS: NLP improved data accuracy by extracting confirmed WM diagnoses and dates, addressing EHR gaps, and reducing manual abstraction. This streamlined process accelerated timelines and enhanced research quality, demonstrating the potential of NLP for real-world data improvement.
Conference/Value in Health Info
2025-05, ISPOR 2025, Montréal, Quebec, CA
Value in Health, Volume 28, Issue S1
Code
RWD72
Topic
Real World Data & Information Systems
Topic Subcategory
Data Protection, Integrity, & Quality Assurance
Disease
SDC: Oncology, SDC: Rare & Orphan Diseases