Rare-Xtract: A Hybrid Pipeline for Generating Real-World Data Products From Unstructured Electronic Medical Records (EMRs) of Rare Diseases Leveraging Natural Language Processing (NLP) and Keyword Extraction Techniques

Author(s)

James A1, Has C2, Palucci O2, Mertz E2, Rentsch I2
1Centogene GmbH, Rostock, MV, Germany, 2Centogene GmbH, Berlin, Berlin, Germany

Presentation Documents

OBJECTIVES: The aim of the study is to develop a hybrid pipeline that generates valuable real-world data products from heterogenous and unstructured physician notes and EMRs across multiple geographies for real-world evidence studies from rare diseases. RARE-XTRACT is a hybrid pipeline developed by leveraging Natural language processing (NLP), keyword extraction, and classification techniques.

METHODS: The real-world data, EMRs, encompassing physician notes, lab results, dietary information, medication records, and other observational data, are stored in our database in an unstructured format. The RARE-XTRACT is designed to handle this complexity, and consists of four key modules. 1) Text Extraction: Unstructured text data are extracted using web scraping techniques. 2) Data Classification and Labeling: Extracted data are categorized into miscellaneous and predefined lists of 65 disease-specific clinical categories, using a combination of the customized Named Entity Recognition (NER) model and the keyword extraction technique. 3) Data Extraction: disease-specific clinical categories, including the value, units, and dosages, are systematically extracted, and compiled into CSV files. 4) Data Preprocessing: Extracted data undergo rigorous preprocessing and cleaning to ensure accuracy and consistency.

RESULTS: We implemented RARE-XTRACT to process all available PDFs for four rare diseases diagnosed between 2013-23 that were identified within our LIMS database, encompassing data from over 500 rare disease patients from different geographies. The pipeline extracted 65 disease-specific clinical categories and values for including different sub-domains under physical measurement, diet, medication, lab results, and progress notes. The pipeline achieved a precision of 86% for classification and labeling, and a recall value of 75% for value and unit extraction across all domains and sub-domains.

CONCLUSIONS: RARE-XTRACT yields data readily usable for integration into predictive models or real-world evidence studies, empowering pharmaceutical companies to make informed decisions in rare disease research and real-world evidence studies.

Conference/Value in Health Info

2024-11, ISPOR Europe 2024, Barcelona, Spain

Value in Health, Volume 27, Issue 12, S2 (December 2024)

Code

RWD118

Topic

Medical Technologies, Methodological & Statistical Research, Real World Data & Information Systems, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Data Protection, Integrity, & Quality Assurance, Diagnostics & Imaging, Electronic Medical & Health Records

Disease

Rare & Orphan Diseases

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×