Beyond Accuracy: Automated De-Identification of Large Real-World Clinical Text Datasets

Author(s)

Kocaman V1, Talby D2, Ul Hak H2
1John Snow Labs, Echt, Netherlands, 2John Snow Labs, Lewes, DE, USA

OBJECTIVES: This study focuses on bridging the gaps identified in automated de-identification of real-world clinical text, with the aim of enabling secondary use of medical data for various health, research, and public safety purposes. The objective is to introduce a pre-trained Natural Language Processing (NLP) pipeline that achieves superior accuracy on academic benchmarks while delivering practical, large-scale and multi-language solutions. The goal also extends to the development of a method for data obfuscation, replacing Protected Health Information (PHI) with random yet medically relevant surrogates, and defining the key requirements for a real-world de-identification system beyond just accuracy.

METHODS: A hybrid context-based model architecture was developed, combining state-of-the-art NLP and a contextual-rule based engine. The system was trained and tested on over one billion real-world clinical notes, with the assistance of several independent organizations for certification. An innovative method for data obfuscation was devised, focusing on replacing PHI with medically consistent random surrogates while maintaining data integrity. Multi-language support was also implemented, currently supporting seven European languages without any requirement for fine-tuning.

RESULTS: The proposed hybrid system outperformed neural-network-only model by 10%, while scoring 50%, 475%, and 575% fewer errors compared to AWS, Azure, and GCP services respectively. The system achieved over 98% coverage of sensitive data across the supported languages, making it a leading solution in terms of coverage and accuracy. The obfuscation approach was successful in retaining name, date, gender, clinical, and format consistency in the anonymized documents.

CONCLUSIONS: The study combines state-of-the-art accuracy, robust engineering principles, and independently certified real-world applicability. By addressing challenges beyond accuracy, the system fundamentally changes the landscape of clinical data availability. Despite the need for further enhancements for broader, quicker, and cheaper application, the system is already being widely deployed, unlocking opportunities for the safer and compliant secondary use of medical data.

Conference/Value in Health Info

2023-11, ISPOR Europe 2023, Copenhagen, Denmark

Value in Health, Volume 26, Issue 11, S2 (December 2023)

Code

RWD143

Topic

Real World Data & Information Systems

Topic Subcategory

Data Protection, Integrity, & Quality Assurance

Disease

No Additional Disease & Conditions/Specialized Treatment Areas

Explore Related HEOR by Topic


Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×