Deep Real-World Analysis of Patients With Myelofibrosis Using Natural Language Processing and Machine Learning: A Methods Description

Author(s)

FULYA SEN NIKITAS, MSc1, TIM d'ESTRUBE, BSc1, BELLA VO, PharmD2, Paul Juneau, MS3, JULIEN GRAFF, PhD, PharmD4, Shiyuan Zhang, BSc, MSc2, MARGARITA POSSO, MD, PhD5, MARíA LOPEZ, MS5, MARCO GARRANZO, MS5, LUCIA CABAL-HIERRO, PhD5, BERTA BIESCAS, PhD5.
1GSK, London, United Kingdom, 2GSK, Collegeville, PA, USA, 3GSK, Boyds, PA, USA, 4GSK, Baar, Zug, Switzerland, 5Savana Research, Madrid, Spain.
OBJECTIVES: Myelofibrosis (MF) is a clinically heterogeneous hematologic malignancy, underscoring the need for comprehensive real-world data (RWD) to inform clinical decision-making. This study employs clinical Natural Language Processing (cNLP) techniques to extract and analyze RWD from electronic health records (EHRs) to characterize patients with MF. Objectives include describing comorbidities, clinical characteristics, treatment patterns, healthcare utilization, clinical outcomes, and factors associated with disease progression and mortality. This protocol serves as an umbrella framework for upcoming studies.
METHODS: This multicenter observational cohort study uses EHRs from hospitals in Spain, the UK, Austria, and France to retrospectively extract clinical information from structured and unstructured data from 2015 to 2027. More than 200 clinical entities will be extracted using EHRead, a dedicated cNLP pipeline that leverages standardized terminologies (SNOMED-CT, ATC, LOINC) to construct predefined clinical variables related to MF. Manual annotation projects will be conducted by medical experts to evaluate EHRead’s performance by measuring precision, recall, and F1 score. Data extraction will involve identifying key terms using named-entity recognition and linking models, detecting negation markers, and classifying clinical entities based on sections and temporality, along with other specialized cNLP models.
RESULTS: Currently, the study protocol has been approved by 6 participating hospitals, with ≈229 patients to be included over an 11-year recruitment period. Preliminary results are expected by December 2025. The precision and F1 score of EHRead in extracting 41 variables used for MF cohort definition are 0.96 and 0.94, respectively. The evaluation and fine-tuning of cNLP models are currently ongoing.
CONCLUSIONS: The cNLP techniques used in this study will involve large-scale RWD extraction from EHRs, providing valuable insights into the clinical characteristics and management of MF. Although cNLP evaluation is still ongoing, the models tested thus far demonstrated high performance, supporting their broader applicability in real-world oncology research. Funding: GSK (study ID: 221275)

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

RWD55

Topic

Real World Data & Information Systems

Topic Subcategory

Health & Insurance Records Systems

Disease

Oncology, Rare & Orphan Diseases

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×