Comparative Evaluation of Sequence-Encoding Strategies for Clustering Inhaled Therapy Pathways in COPD Patients Using Real-World Data
Author(s)
Romane Péan, MSc1, Marie Génin, MSc1, Nina Temam, PharmD1, Diane Vincent, MSc1, Rachel Nadif, PhD2, Sofiane Kab, PharmD, PhD2, Nicolas Roche, MD, PhD3, Pauline Guilmin, MSc1.
1Quinten Health, Paris, France, 2INSERM, Villejuif, France, 3Assistance Publique–Hôpitaux de Paris (AP-HP), Paris, France.
1Quinten Health, Paris, France, 2INSERM, Villejuif, France, 3Assistance Publique–Hôpitaux de Paris (AP-HP), Paris, France.
OBJECTIVES: Identifying and comparing patient treatment pathways is critical to inform healthcare decision-making, yet the high variability and complexity of real-world sequences pose methodological challenges. In clustering, sequence encoding plays a pivotal role, but no consensus exists on the optimal strategy. This study compares three encoding approaches applied to real-world therapeutic sequences in COPD to assess their ability to generate clinically meaningful patient clusters.
METHODS: Data were derived from the French CONSTANCES cohort linked to the national health claims database (SNDS). Participants were classified as COPD via spirometry or questionnaires and their five-year sequences of inhaled maintenance therapies (mono-, bi- or triple therapy) mapped using ATC level 7 codes were extracted. Temporal rules captured therapy overlaps and durations. Three encoding approaches were applied: (A) SeqMining: frequent subsequence extraction via SPADE algorithm, generating binary feature vectors. (B) SeqToChar: character string representation of sequences with Jaro distance for pairwise similarity. (C) Autoencoder: deep learning model producing continuous embeddings in a reduced latent space. Each encoding fed a k-medoids clustering, with cluster validity assessed via silhouette scores and UMAP projections. The best-performing method was further examined through trajectory visualizations to assess clinical interpretability.
RESULTS: Among 4,982 participants with COPD, 1,926 met the five-year follow-up and treatment criteria. They had two therapeutic combinations on average, with 90% receiving inhaled corticosteroids. All encoding methods yielded interpretable clusters. SeqToChar achieved the highest silhouette score (0.63 vs. 0.54 for SeqMining, 0.58 for Autoencoder), and visual inspection suggested coherent clinical patterns.
CONCLUSIONS: Although SeqToChar showed a slight advantage, performance differences across encoding methods remained limited. Relying solely on technical metrics may not sufficiently support method selection. Introducing predefined clinical relevance criteria could help assess clustering quality from both methodological and real-world perspectives, offering a more comprehensive basis for interpreting patient trajectories in HTA settings.
METHODS: Data were derived from the French CONSTANCES cohort linked to the national health claims database (SNDS). Participants were classified as COPD via spirometry or questionnaires and their five-year sequences of inhaled maintenance therapies (mono-, bi- or triple therapy) mapped using ATC level 7 codes were extracted. Temporal rules captured therapy overlaps and durations. Three encoding approaches were applied: (A) SeqMining: frequent subsequence extraction via SPADE algorithm, generating binary feature vectors. (B) SeqToChar: character string representation of sequences with Jaro distance for pairwise similarity. (C) Autoencoder: deep learning model producing continuous embeddings in a reduced latent space. Each encoding fed a k-medoids clustering, with cluster validity assessed via silhouette scores and UMAP projections. The best-performing method was further examined through trajectory visualizations to assess clinical interpretability.
RESULTS: Among 4,982 participants with COPD, 1,926 met the five-year follow-up and treatment criteria. They had two therapeutic combinations on average, with 90% receiving inhaled corticosteroids. All encoding methods yielded interpretable clusters. SeqToChar achieved the highest silhouette score (0.63 vs. 0.54 for SeqMining, 0.58 for Autoencoder), and visual inspection suggested coherent clinical patterns.
CONCLUSIONS: Although SeqToChar showed a slight advantage, performance differences across encoding methods remained limited. Relying solely on technical metrics may not sufficiently support method selection. Introducing predefined clinical relevance criteria could help assess clustering quality from both methodological and real-world perspectives, offering a more comprehensive basis for interpreting patient trajectories in HTA settings.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR56
Topic
Methodological & Statistical Research, Real World Data & Information Systems
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
Respiratory-Related Disorders (Allergy, Asthma, Smoking, Other Respiratory)