Comparing Individualized Treatment Effect Inference Methods Through a Simulation Study
Author(s)
Diane Vincent, Antoine Movschin, MSc, Tristan Fauvel, PhD.
Quinten Health, Paris, France.
Quinten Health, Paris, France.
OBJECTIVES: Randomized controlled trials (RCTs) are the gold standard for estimating the average treatment effect (ATE), based on the equipoise principle and proper trial size and design, but they are typically not geared towards individual effect estimation. Conversely, real-world data (RWD) richness and abundance offers an opportunity to estimate the individualized treatment effect (ITE), albeit limited by the biases induced by violated causal assumptions. While a variety of machine learning (ML)-based methods exist to estimate the conditional average treatment effect (CATE) with observational data, there is no guideline for practitioners to choose the best suited method, and systematic comparisons using unbiased benchmark datasets remain limited. This simulation study aims at guiding the selection of the best method by comparing a range of approaches across diverse performance metrics and constraints.
METHODS: A set of representative ML-based CATE estimation methods, including meta-learners, tree-based, deep learning and Bayesian methods are evaluated. The simulation study, via a data-generating process (DGP), enables control and knowledge of the treatment effect, and emulates a variety of realistic scenarios by implementing different constraints on sample size, CATE heterogeneity, covariate overlap, confounding (both observed and unobserved), etc. The metrics most frequently mentioned in the literature are used to assess the methods, including standard ML metrics, Precision in Estimating Heterogeneous Effect (PEHE), and its approximations.
RESULTS: CATE estimation methods are accurate, with the best reducing PEHE by a factor of 27.5 compared to the ATE baseline in average, but confidence intervals remain wide, representing 25% of CATE values on average, and no single method outperforms others across all scenarios and metrics. Observable metrics poorly reflect true performance: coverage is uncorrelated, and PEHE shows only moderate alignment.
CONCLUSIONS: We provide a comprehensive mapping of the evaluated methods under the defined constraints, providing guidance for method selection in real-world contexts, tailored to specific use-cases.
METHODS: A set of representative ML-based CATE estimation methods, including meta-learners, tree-based, deep learning and Bayesian methods are evaluated. The simulation study, via a data-generating process (DGP), enables control and knowledge of the treatment effect, and emulates a variety of realistic scenarios by implementing different constraints on sample size, CATE heterogeneity, covariate overlap, confounding (both observed and unobserved), etc. The metrics most frequently mentioned in the literature are used to assess the methods, including standard ML metrics, Precision in Estimating Heterogeneous Effect (PEHE), and its approximations.
RESULTS: CATE estimation methods are accurate, with the best reducing PEHE by a factor of 27.5 compared to the ATE baseline in average, but confidence intervals remain wide, representing 25% of CATE values on average, and no single method outperforms others across all scenarios and metrics. Observable metrics poorly reflect true performance: coverage is uncorrelated, and PEHE shows only moderate alignment.
CONCLUSIONS: We provide a comprehensive mapping of the evaluated methods under the defined constraints, providing guidance for method selection in real-world contexts, tailored to specific use-cases.
Conference/Value in Health Info
2025-11, ISPOR Europe 2025, Glasgow, Scotland
Value in Health, Volume 28, Issue S2
Code
MSR58
Topic
Methodological & Statistical Research
Topic Subcategory
Artificial Intelligence, Machine Learning, Predictive Analytics, Confounding, Selection Bias Correction, Causal Inference
Disease
Personalized & Precision Medicine