DATA-DRIVEN BASELINE MATCHING: ENHANCING INDIRECT COMPARISONS WITH A MACHINE LEARNING-INFORMED FRAMEWORK FOR SELECTING HOMOGENEOUS TRIAL SETS
Author(s)
Saswata Paul Choudhury, MSc, Sekhar K. Dutta, MSc, Subhajit Gupta, MSc;
PharmaQuant Insights Private Limited, Kolkata, India
PharmaQuant Insights Private Limited, Kolkata, India
OBJECTIVES: Ensuring baseline comparability across randomized trials is essential for indirect treatment comparisons (ITCs). We present an machine-learning-informed methodology that quantifies similarity across trials and ranks all possible trial combinations, so that investigators can transparently select the most homogeneous pools for downstream comparative work.
METHODS: Ten trial-level baseline characteristics were simulated using appropriate underlying distributions across 8 trials (termed T1-T8). We applied multiple clustering algorithms to identify inherent groupings and computed pairwise dissimilarities between trials in a reduced latent-space representation. For every non-empty subset of trials, we derived complementary subset-level metrics that quantify typical within-group separation (e.g. mean pairwise distance), the maximum internal discordance, and the sample-size-weighted proximity to a pooled centroid. Distance-based metrics were then mapped to bounded similarity indices via a smooth kernel transformation. Penalization was applied to prevent very small trial subsets from being over-favored, as limited pairwise comparisons can exaggerate apparent homogeneity and reduce the robustness of network meta-analysis results.
RESULTS: A total 255 non-empty trial subsets were analyzed. A Dendogram from the hierarchical clustering was used to visualize trial subset selection pathways . The composite similarity metric suggested 2 optimal trial combinations. T3, T5, and T6 were the most cohesive (mean 0.438; similarity 0.566) amongst 3 trial combinations, while T2, T3, T5, and T6 showed the highest internal consistency (mean 0.433; similarity 0.537) among 4 trial combinations.
CONCLUSIONS: This methodology provides a framework for scoring and ranking trial combinations and suggesting optimal homogenous study pools. By providing comparable similarity metrics and visualizations across trial combinations, the approach enables informed pooling decisions & structured sensitivity analyses for indirect comparisons, underscoring the utility of ML-driven methods for balancing trial homogeneity in ITCs. Future validation is required to evaluate effects on bias and precision in comparative effectiveness.
METHODS: Ten trial-level baseline characteristics were simulated using appropriate underlying distributions across 8 trials (termed T1-T8). We applied multiple clustering algorithms to identify inherent groupings and computed pairwise dissimilarities between trials in a reduced latent-space representation. For every non-empty subset of trials, we derived complementary subset-level metrics that quantify typical within-group separation (e.g. mean pairwise distance), the maximum internal discordance, and the sample-size-weighted proximity to a pooled centroid. Distance-based metrics were then mapped to bounded similarity indices via a smooth kernel transformation. Penalization was applied to prevent very small trial subsets from being over-favored, as limited pairwise comparisons can exaggerate apparent homogeneity and reduce the robustness of network meta-analysis results.
RESULTS: A total 255 non-empty trial subsets were analyzed. A Dendogram from the hierarchical clustering was used to visualize trial subset selection pathways . The composite similarity metric suggested 2 optimal trial combinations. T3, T5, and T6 were the most cohesive (mean 0.438; similarity 0.566) amongst 3 trial combinations, while T2, T3, T5, and T6 showed the highest internal consistency (mean 0.433; similarity 0.537) among 4 trial combinations.
CONCLUSIONS: This methodology provides a framework for scoring and ranking trial combinations and suggesting optimal homogenous study pools. By providing comparable similarity metrics and visualizations across trial combinations, the approach enables informed pooling decisions & structured sensitivity analyses for indirect comparisons, underscoring the utility of ML-driven methods for balancing trial homogeneity in ITCs. Future validation is required to evaluate effects on bias and precision in comparative effectiveness.
Conference/Value in Health Info
2026-05, ISPOR 2026, Philadelphia, PA, USA
Value in Health, Volume 29, Issue S6
Code
MSR39
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics
Disease
No Additional Disease & Conditions/Specialized Treatment Areas