Rare-Xtract: A Hybrid Pipeline for Generating Real-World Data Products From Unstructured Electronic Medical Records (EMRs) of Rare Diseases Leveraging Natural Language Processing (NLP) and Keyword Extraction Techniques

Author(s)

James A¹, Has C², Palucci O², Mertz E², Rentsch I²
¹Centogene GmbH, Rostock, MV, Germany, ²Centogene GmbH, Berlin, Berlin, Germany

Presentation Documents

ISPOR_Poster_211024141798.pdf

OBJECTIVES: The aim of the study is to develop a hybrid pipeline that generates valuable real-world data products from heterogenous and unstructured physician notes and EMRs across multiple geographies for real-world evidence studies from rare diseases. RARE-XTRACT is a hybrid pipeline developed by leveraging Natural language processing (NLP), keyword extraction, and classification techniques.

METHODS: The real-world data, EMRs, encompassing physician notes, lab results, dietary information, medication records, and other observational data, are stored in our database in an unstructured format. The RARE-XTRACT is designed to handle this complexity, and consists of four key modules. 1) Text Extraction: Unstructured text data are extracted using web scraping techniques. 2) Data Classification and Labeling: Extracted data are categorized into miscellaneous and predefined lists of 65 disease-specific clinical categories, using a combination of the customized Named Entity Recognition (NER) model and the keyword extraction technique. 3) Data Extraction: disease-specific clinical categories, including the value, units, and dosages, are systematically extracted, and compiled into CSV files. 4) Data Preprocessing: Extracted data undergo rigorous preprocessing and cleaning to ensure accuracy and consistency.

RESULTS: We implemented RARE-XTRACT to process all available PDFs for four rare diseases diagnosed between 2013-23 that were identified within our LIMS database, encompassing data from over 500 rare disease patients from different geographies. The pipeline extracted 65 disease-specific clinical categories and values for including different sub-domains under physical measurement, diet, medication, lab results, and progress notes. The pipeline achieved a precision of 86% for classification and labeling, and a recall value of 75% for value and unit extraction across all domains and sub-domains.

CONCLUSIONS: RARE-XTRACT yields data readily usable for integration into predictive models or real-world evidence studies, empowering pharmaceutical companies to make informed decisions in rare disease research and real-world evidence studies.

Conference/Value in Health Info

2024-11, ISPOR Europe 2024, Barcelona, Spain

Value in Health, Volume 27, Issue 12, S2 (December 2024)

Code

RWD118

Topic

Medical Technologies, Methodological & Statistical Research, Real World Data & Information Systems, Study Approaches

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics, Data Protection, Integrity, & Quality Assurance, Diagnostics & Imaging, Electronic Medical & Health Records

Disease

Rare & Orphan Diseases

Explore Related HEOR by Topic

Presentation