Comparative Evaluation of Sequence-Encoding Strategies for Clustering Inhaled Therapy Pathways in COPD Patients Using Real-World Data

Author(s)

Romane Péan, MSc1, Marie Génin, MSc1, Nina Temam, PharmD1, Diane Vincent, MSc1, Rachel Nadif, PhD2, Sofiane Kab, PharmD, PhD2, Nicolas Roche, MD, PhD3, Pauline Guilmin, MSc1.
1Quinten Health, Paris, France, 2INSERM, Villejuif, France, 3Assistance Publique–Hôpitaux de Paris (AP-HP), Paris, France.
OBJECTIVES: Identifying and comparing patient treatment pathways is critical to inform healthcare decision-making, yet the high variability and complexity of real-world sequences pose methodological challenges. In clustering, sequence encoding plays a pivotal role, but no consensus exists on the optimal strategy. This study compares three encoding approaches applied to real-world therapeutic sequences in COPD to assess their ability to generate clinically meaningful patient clusters.
METHODS: Data were derived from the French CONSTANCES cohort linked to the national health claims database (SNDS). Participants were classified as COPD via spirometry or questionnaires and their five-year sequences of inhaled maintenance therapies (mono-, bi- or triple therapy) mapped using ATC level 7 codes were extracted. Temporal rules captured therapy overlaps and durations. Three encoding approaches were applied: (A) SeqMining: frequent subsequence extraction via SPADE algorithm, generating binary feature vectors. (B) SeqToChar: character string representation of sequences with Jaro distance for pairwise similarity. (C) Autoencoder: deep learning model producing continuous embeddings in a reduced latent space. Each encoding fed a k-medoids clustering, with cluster validity assessed via silhouette scores and UMAP projections. The best-performing method was further examined through trajectory visualizations to assess clinical interpretability.
RESULTS: Among 4,982 participants with COPD, 1,926 met the five-year follow-up and treatment criteria. They had two therapeutic combinations on average, with 90% receiving inhaled corticosteroids. All encoding methods yielded interpretable clusters. SeqToChar achieved the highest silhouette score (0.63 vs. 0.54 for SeqMining, 0.58 for Autoencoder), and visual inspection suggested coherent clinical patterns.
CONCLUSIONS: Although SeqToChar showed a slight advantage, performance differences across encoding methods remained limited. Relying solely on technical metrics may not sufficiently support method selection. Introducing predefined clinical relevance criteria could help assess clustering quality from both methodological and real-world perspectives, offering a more comprehensive basis for interpreting patient trajectories in HTA settings.

Conference/Value in Health Info

2025-11, ISPOR Europe 2025, Glasgow, Scotland

Value in Health, Volume 28, Issue S2

Code

MSR56

Topic

Methodological & Statistical Research, Real World Data & Information Systems

Topic Subcategory

Artificial Intelligence, Machine Learning, Predictive Analytics

Disease

Respiratory-Related Disorders (Allergy, Asthma, Smoking, Other Respiratory)

Your browser is out-of-date

ISPOR recommends that you update your browser for more security, speed and the best experience on ispor.org. Update my browser now

×