Comparing the Addition of Natural Language Processing (NLP) to US Electronic Health Records (EHRs): Insights Into Extracting Predicted Forced Vital Capacity (PpFVC) From Unstructured Clinical Notes
Author(s)
Monica Silver, PhD, MPH, Kevin Lavelle, BS, Matthew Chang, BS, Randall Thompson, DC, MS, Jennifer Falk, DPM, MS, Maryam Ajose, MPH, Mac Bonafede, PhD, MPH, anusorn thanataveerat, DrPH.
Veradigm, Raleigh, NC, USA.
Veradigm, Raleigh, NC, USA.
OBJECTIVES: To compare patient counts and %predicted forced vital capacity (ppFVC) results in EHR data with and without NLP-based clinical notes extraction among patients with fibrosing interstitial lung disease (F-ILD).
METHODS: Adults (18+) in the Veradigm Network EHR (VNEHR) linked to claims with ≥2 ICD-10-CM diagnosis codes for fibrosis or ≥1 diagnosis code for fibrosis and ILD were identified in 2017-2023. Inclusion criteria included EHR activity 12M pre- and post-index, no baseline fibrosis or ILD, and a ppFVC value in clinical notes or FVC, age, sex, and height to calculate ppFVC. A rule-based NLP pipeline was used to extract and standardize FVC from patient notes using keywords, parts-of-speech tags, and a hierarchical cascade of regular expressions that were validated by clinical human review and iteratively fine-tuned for accuracy. We analyzed ppFVC numerically and categorically (Excellent: >120%, Normal: 80-120%, Restriction-Severe to Mild: <50-79%) in baseline and follow-up, with and without NLP enhancement.
RESULTS: Of the 12,489 and 18,594 patients identified in the VNEHR without and with NLP-enhanced ppFVC extraction, respectively, proportions of ≥1 FVC test in baseline were 43.1% and 42.4%, with an average of 1.5 ppFVC tests each. Baseline ppFVC categories showed that 51.4% and 56.6% of patients had evidence of lung restriction. In follow-up, 81.3% and 82.4% of patients had ≥1 FVC test. Patients were followed for a mean of 1,234 and 1,203 days with an average of 3.2 and 3.3 follow-up tests, respectively. Follow-up ppFVC categories were similar to baseline (50.0% and 54.6 % of patients with lung restriction).
CONCLUSIONS: The incorporation of NLP nearly doubled the sample size and number of ppFVC tests with consistent ppFVC results across cohorts. This indicates that NLP-based extraction of ppFVC from unstructured EHR notes may be a viable approach to increase sample size and clinical data capture, thereby improving statistical power for downstream analyses.
METHODS: Adults (18+) in the Veradigm Network EHR (VNEHR) linked to claims with ≥2 ICD-10-CM diagnosis codes for fibrosis or ≥1 diagnosis code for fibrosis and ILD were identified in 2017-2023. Inclusion criteria included EHR activity 12M pre- and post-index, no baseline fibrosis or ILD, and a ppFVC value in clinical notes or FVC, age, sex, and height to calculate ppFVC. A rule-based NLP pipeline was used to extract and standardize FVC from patient notes using keywords, parts-of-speech tags, and a hierarchical cascade of regular expressions that were validated by clinical human review and iteratively fine-tuned for accuracy. We analyzed ppFVC numerically and categorically (Excellent: >120%, Normal: 80-120%, Restriction-Severe to Mild: <50-79%) in baseline and follow-up, with and without NLP enhancement.
RESULTS: Of the 12,489 and 18,594 patients identified in the VNEHR without and with NLP-enhanced ppFVC extraction, respectively, proportions of ≥1 FVC test in baseline were 43.1% and 42.4%, with an average of 1.5 ppFVC tests each. Baseline ppFVC categories showed that 51.4% and 56.6% of patients had evidence of lung restriction. In follow-up, 81.3% and 82.4% of patients had ≥1 FVC test. Patients were followed for a mean of 1,234 and 1,203 days with an average of 3.2 and 3.3 follow-up tests, respectively. Follow-up ppFVC categories were similar to baseline (50.0% and 54.6 % of patients with lung restriction).
CONCLUSIONS: The incorporation of NLP nearly doubled the sample size and number of ppFVC tests with consistent ppFVC results across cohorts. This indicates that NLP-based extraction of ppFVC from unstructured EHR notes may be a viable approach to increase sample size and clinical data capture, thereby improving statistical power for downstream analyses.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
RWD41
Topic
Methodological & Statistical Research, Real World Data & Information Systems
Disease
Respiratory-Related Disorders (Allergy, Asthma, Smoking, Other Respiratory)