Enhancing Data Quality in Health Research: Performance Insights of a Clinical NLP Algorithm for Diverse Medical Domains
Author(s)
Oeste C, Van Canneyt J, Kramchaninova A, Sterckx L, Farokhshad N, Bassez I, Pervaiz S, Van Gorp G, Hens D
LynxCare, Leuven, Flemish Brabant, Belgium
Presentation Documents
OBJECTIVES: This study presents performance insights into our clinical Natural Language Processing (NLP) algorithm, comparing out-of-the-box (OOTB) precision and recall to post-validation metrics. We aim to highlight the initial effectiveness of OOTB metrics and demonstrate the crucial role of validation by annotators and physicians in boosting performance across diverse medical domains.
METHODS: The NLP algorithm processes unstructured data from electronic health records (EHRs) of hospitals within our network, focusing on broad-scope data points and specific therapeutic areas (TAs) to generate OMOP-CDM databases that also include structured data sources. Initial OOTB precision and recall were calculated for 312 data points in over 23,000 records. Subsequently, physicians reviewed data point hierarchy and relevance, and validation against a human-generated gold standard was performed. Precision (true positives among all detected) and recall (true positives among all actual data points) are measured post-validation. Validation by annotators and physicians ensures comprehensive quality assurance and enhances algorithm performance.
RESULTS: Initial OOTB metrics show that for 58 data points and 7207 records, broad-scope terms achieve 86.3% precision and 82.1% recall. Post-validation, precision increases to 96.6% and recall to 96.3%. In oncology, for 193 data points and 13215 records, initial precision is 82.3% and recall 79.1%, improving to 96.8% precision and 94.3% recall post-validation. Cardiology shows initial 88.6% precision and 79.4% recall, improving to 96.6% precision and 95.6% recall post-validation in the 62 data points and 2686 records that were assessed.
CONCLUSIONS: Our clinical NLP algorithm significantly enhances real-world data (RWD) quality and integrity by enriching OMOP-CDM datasets with unstructured data. Initial out-of-the-box (OOTB) metrics demonstrate promising results, with subsequent validation across diverse medical domains validating its effectiveness for continuous data enrichment. Its successful implementation in studies leading to peer-reviewed scientific manuscripts underscores its role in supporting large-scale, cross-institutional research initiatives, contributing to evidence-based medical insights.
Conference/Value in Health Info
Value in Health, Volume 27, Issue 12, S2 (December 2024)
Code
MSR10
Topic
Methodological & Statistical Research, Real World Data & Information Systems, Study Approaches
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Data Protection, Integrity, & Quality Assurance, Electronic Medical & Health Records, Reproducibility & Replicability
Disease
Cardiovascular Disorders (including MI, Stroke, Circulatory), Oncology