Leveraging External Validation Dataset to Adjust for Missing Confounders in an Enhanced Two-Stage Zero-Inflated Poisson Model Design: A Methodological and Simulation Study

Speaker(s)

Wu DBC1, Lin HW2
1Johnson & Johnson Innovative Medicine, Singapore, Singapore, 2Soochow University, Taipei, Taiwan

OBJECTIVES: Analyzing large administrative claims databases has become popular in medical research due to cost efficiency and data availability. However, these databases often lack detailed clinical and socioeconomic confounding variables (CVs). This limitation can bias treatment effects in observational studies as unobserved CVs cannot be controlled. This study developed a statistical method to adjust missing CVs and enhance testing power.

METHODS: We developed a two-stage-calibration zero-inflated Poisson (TSC-ZIP) model that accounts for excessive number of zeros. In stage 1, a ZIP model was fitted to a large dataset (LD) with the observed CVs. In stage 2, another ZIP was built based on the 2nd smaller external dataset containing missing CVs from LD. To mitigate the risk of overfitting and potential divergence of estimated parameters, propensity scores for stages 1 and 2 were calculated using the covariates available at each stage by fitting a multivariate logistic regression model. A series of simulations were performed to verify the performance of the TSC-ZIP vs. ZIP model.

RESULTS: First, we’ve mathematically proved that the regression coefficients derived from the above TSC-ZIP model were unbiased and consistent with their variances effectively reduced leading to improved power compared to a single ZIP model. Second, the simulation showed that under different level of assumed true treatment effect statistical powers of the TSC-ZIP model are 0.608, 0.826, and 0.904, respectively while those of the ZIP model are 0.460, 0.678, and 0.878, i.e. an average of 30% improvement. Third, a larger sample size in stage 1 led to greater power for the TSC-ZIP model, resulting in reduced variances. Fourth, the regression coefficients of the TSC-ZIP model remain unbiased even when the missing CVs have a stronger association with the outcome.

CONCLUSIONS: The TSC-ZIP model is demonstrated to be a reliable framework that effectively adjusts for missing CVs.

Code

MSR101

Topic

Methodological & Statistical Research, Study Approaches

Topic Subcategory

Confounding, Selection Bias Correction, Causal Inference, Electronic Medical & Health Records

Disease

No Additional Disease & Conditions/Specialized Treatment Areas