Win Ratio in the Presence of Censored Data: Can Probabilistic Variants Improve Robustness?
Author(s)
Mateusz Nikodem, PhD1, Julita Janik, B.Sc.2, Michal Kochmanski, B.Sc.2, Sylvaine Barbier, MSc3, Paulina Pierzchala, PhD4.
1Director, RWE & Biostatistics, Putnam, Cracow, Poland, 2AGH University of Science and Technology, Cracow, Poland, 3Putnam, Lyon, France, 4Putnam, Cracow, Poland.
1Director, RWE & Biostatistics, Putnam, Cracow, Poland, 2AGH University of Science and Technology, Cracow, Poland, 3Putnam, Lyon, France, 4Putnam, Cracow, Poland.
OBJECTIVES: To critically assess the validity of the win ratio (WR) methods in the presence of substantial early censoring, and to propose conceptual extensions, particularly probabilistic variants, that may mitigate related biases.
METHODS: We focus on hierarchical composite endpoints with time-to-event outcomes with censoring allowed at the top of the hierarchy. In such settings, the transitivity of pairwise comparisons (i.e., if a patient A wins over B and B wins over C, then A should win over C) may not hold. Particularly when censoring occurs early or asymmetrically across treatment arms, it can lead to counterintuitive results. For instance, subgroup-level WRs may favor treatment X over Y, while the overall population-level WR may indicate the reverse. This may happen even when subgroups are perfectly balanced, highlighting a Simpson’s paradox-like phenomenon specific to WR.
We conceptually developed and examined several variants of WR, focusing especially on probabilistic approaches. In these, for patient pairs where one or both are censored, a probability distribution over win/tie/loss is estimated rather than assigning a discrete win/tie/loss classification. To assess robustness, we tested these variants in illustrative scenarios with varying censoring rates and imbalances between study arms. Additionally, we introduced a “subgroup challenge” to evaluate whether paradoxical subgroup-level effects, persist under each analytical approach.
RESULTS: Across theoretical examples, WR methods showed high sensitivity to censoring, yielding divergent or contradictory conclusions. Probabilistic variants demonstrated greater stability and were less prone to dramatic reversals in interpretation.
CONCLUSIONS: WR should be interpreted with caution in the presence of early or imbalanced censoring. Probabilistic approaches may offer a more robust alternative; however, their validity depends on the assumptions and accuracy of the underlying models. Sensitivity analyses should be conducted to assess the stability of conclusions under alternative analytical approaches, especially in the context of censoring.
METHODS: We focus on hierarchical composite endpoints with time-to-event outcomes with censoring allowed at the top of the hierarchy. In such settings, the transitivity of pairwise comparisons (i.e., if a patient A wins over B and B wins over C, then A should win over C) may not hold. Particularly when censoring occurs early or asymmetrically across treatment arms, it can lead to counterintuitive results. For instance, subgroup-level WRs may favor treatment X over Y, while the overall population-level WR may indicate the reverse. This may happen even when subgroups are perfectly balanced, highlighting a Simpson’s paradox-like phenomenon specific to WR.
We conceptually developed and examined several variants of WR, focusing especially on probabilistic approaches. In these, for patient pairs where one or both are censored, a probability distribution over win/tie/loss is estimated rather than assigning a discrete win/tie/loss classification. To assess robustness, we tested these variants in illustrative scenarios with varying censoring rates and imbalances between study arms. Additionally, we introduced a “subgroup challenge” to evaluate whether paradoxical subgroup-level effects, persist under each analytical approach.
RESULTS: Across theoretical examples, WR methods showed high sensitivity to censoring, yielding divergent or contradictory conclusions. Probabilistic variants demonstrated greater stability and were less prone to dramatic reversals in interpretation.
CONCLUSIONS: WR should be interpreted with caution in the presence of early or imbalanced censoring. Probabilistic approaches may offer a more robust alternative; however, their validity depends on the assumptions and accuracy of the underlying models. Sensitivity analyses should be conducted to assess the stability of conclusions under alternative analytical approaches, especially in the context of censoring.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR225
Topic
Clinical Outcomes, Health Technology Assessment, Methodological & Statistical Research
Topic Subcategory
Confounding, Selection Bias Correction, Causal Inference, Missing Data
Disease
No Additional Disease & Conditions/Specialized Treatment Areas