The Official News & Technical Journal Of The International Society For Pharmacoeconomics And Outcomes Research

Use of the Propensity Score Matching Method to Reduce Recruitment Bias in Observational Studies: Application to the Estimation of Drotrecogin Alfa's Impact on Intensive Care Units Workloady

Lionel Riou França MSc, Stéphanie Payet MSc, Katell Le Lay MSc, and Robert Launois PhD, REES France, Paris, France


Randomized clinical trials (RCTs) are considered the gold standard in clinical evaluations [1]. The main reason is that, when properly conducted, randomization ensures that treatment groups are comparable. Consequently, any difference detected is attributable to the intervention. As there is no need to control for confounding factors, the analyses are simpler. RCTs have good internal validity and are relevant for adoption decisions.

Sometimes, randomization is unfeasible, unethical or too costly. Moreover, non-randomized data may be already available. Observational studies (OSs) can then be an alternative to RCTs. They allow measuring the real-life practice and producing more generalisable results. Since these studies are expected to have a good external validity, they are relevant for policy decisions. When treatment allocation is done according to the physician's decisions, we can expect some patients to be given preferentially one of the treatments, resulting in non-comparable groups. We are then in presence of recruitment bias (Figure 1). There is a need to correct for this bias when estimating a treatment's effect in a non randomized study.

 

The Propensity Score Methods

The propensity score methodology allows coping with the presence of recruitment bias [2]. The idea is to model, for each patient, the probability of receiving one of the treatments compared, according to a set of baseline characteristics. This figure is called the propensity score. The PS acts as a summary of all available information. If it is equally distributed among the patients of each treatment group, we can consider that the groups share the same characteristics.

Commonly, the PS is estimated using logistic regression. The presence of missing data among the baseline characteristics can therefore be troublesome. For instance, in the hypothetical case of 30 covariates independently missing for 3 % of the subjects, a list-wise deletion of missing cases would lead to a reduction of 60 % in the sample size. To be able to estimate a PS for each patient, regardless of the presence of missing data, we used multiple imputation methods.

The criteria of a good regression model are well known in classical analysis: the model should be parsimonious and include only statistically significant predictors. The quality of the model can be quantitatively assessed using indicators such as Akaike's AIC. When modeling a propensity score, however, the issue is to ensure adequate balance in the patient's baseline characteristics. There is a need to include as much information as possible in the model.

In order to identify the most imbalanced covariates, we need a quantitative indicator. P-values are not the ideal candidate. Their value depends on the test selected (e.g. parametric or non parametric tests for quantitative variables) and on the sample size. Furthermore, the absence of statistical significance does not necessarily imply the absence of imbalance. A more appropriate summary statistic is the standardized difference (d) between treatment groups (Equation 1). This figure relates the difference in the groups' variable means to their observed variance.

We tested three logistic models using different variable selection strategies (Table 1). The first model strategy was the simplest: all measured baseline covariates were included in the model, without adding interaction terms. The second model selected only the most imbalanced or significant covariates. The third model used the same covariates as in the second model, but added the most significant interaction terms.

There are several ways to use the propensity score estimated. It could be used as an adjustment covariate, along with other outcome predictors. Alternatively, it could be used to weight the patients to make them representative of the population of interest. The PS can also be used to perform a stratified analysis. Finally, it can be used to match patients with similar propensity to receive treatment. The treatment groups in the matched sample are expected to share the same distribution of baseline characteristics, as in a randomized trial.

 We chose to use propensity score matching since it leads to simpler analyses. We performed an optimal matching, where we tried to match each treated patient to a control minimizing the distance between the matched groups.

Three different matched samples were obtained, one for each propensity score model tested.

Application to the PREMISS study

Sepsis is a severe syndrome related to infection [3], with high mortality rates. It is managed in France in Intensive Care Units (ICUs). Drotrecogin alfa has been shown to reduce mortality by 20% in the indication of severe sepsis [4], and a medico-economic model lead to the conclusion that this new treatment was cost-effective in France, in the European treatment indication [5].

The PREMISS study is an observational study carried out in France for the ministry of health to assess this new treatment's impact on healthcare. A control group was recruited before the drug's market authorization; the treatment group was recruited once the drug received its authorization. Eighty-eight intensive care units participated in this multi-center, pre/post study. Data was collected in a decentralized fashion, using an online case report form. In order to control for recruitment biases, forty-six baseline characteristics were retained.

Overall, 1096 patients were included in the study, 587 being in the treatment (i.e. drotrecogin alfa) group. There is some evidence of recruitment bias in the study, since the control group tends to have smaller propensity scores than the treated group (Figure 2).

 However, there is satisfying overlap in the groups' propensity scores, indicating that matching is conceivable.

Table 2 summarizes the performance of the three PS models. In the resulting PS matched samples, model M2 keeps 79% of the patients and model M3 keeps 68% of them. The balance ratio is the ratio of the sum of the absolute values of the 46 standardized differences in the initial sample by the same sum in the matched sample. Model M1 performs best in reducing total imbalance, with a ratio of 2.39. However, some  initial covariates remain unbalanced in all PS matched samples. Using a threshold of 10%, 2 baseline characteristics remain unbalanced in model M1, versus 5 in model M3.

As the standardized differences indicate, patients included in the treatment group were younger and had less co-morbidities (as measured by the McCabe severity score). As age was entered in all PS models as a quantitative variable, neither of the PS models succeeded in achieving balance for the proportion of older patients.

Model M1 was selected for the remaining analyses, as it leads to the better balance between treatment groups.

One of the goals of the PREMISS study was to estimate drotrecogin alfa's economic impact on the intensive care units. The study collected a thesaurus of medical acts as defined in the new French common classification of medical acts, the CCAM. Each act is associated with a relative cost index, allowing for the estimation of the global ICU workload. Since this workload is highly skewed, we used a gamma regression model to estimate its increase among drotrecogin alfa treated patients. A random effects model was fitted in order to account for the clustering of the patients among the intensive care units.

Table 3 gives the workload increase estimates among drotrecogin alfa treated patients using four different methods. Without taking into account the presence of recruitment bias, a full sample analysis estimates that treating patients with the new drug will increase workload by 28%. This figure is overestimated, since the patients included in the control group tended to be more severe. When adjusting for the presence of recruitment bias, this estimate lowers to 19% in the full sample, a figure similar to the estimate obtained in the crude analysis of the PS matched sample (18%). However, further adjustments in the PS matched sample reduce this estimate to a 14% increase.

Conclusion

The PS methodology has shown to be at least as good as multivariate adjustment methods. Its ease of use and of communication can make it appealing. However, conducting a good PS analysis requires careful consideration of the initial characteristics to measure (a large number of variables will increase the burden of data collection). If performing PS matching, there is a need for sufficient overlap between the groups, and the sample size may be increased to take into account that the more extreme patients will be excluded.

More essential is the fact that the PS methods only take into account observed variables. There is still a possibility for the presence of hidden bias. Furthermore, the PS methods allow to reduce recruitment bias, but not necessarily to eliminate it.

Finally, PS matching will reduce the study's external validity, since only a subset of the treated patients is used for the analysis.

 In conclusion, the PS is a useful tool for the analysis of observational data, but, as any other tool, it has some limitations that need to be kept in mind.

References

  1. Dunn D, Babiker A, Hooker M, Darbyshire J. The dangers of inferring treatment effects from observational data: a case study in HIV infection. Control Clin Trials 2002;23:106-10.
  2. Rosenbaum PR, Rubin DB. The central role of the propensity score in observational studies for causal effects. Biometrika 1983;70:41-55.
  3. Bone RC, Balk RA, Cerra FB, et al. Definitions for sepsis and organ failure and guidelines for the use of innovative therapies in sepsis. The ACCP/SCCM Consensus Conference Committee. American College of Chest Physicians/Society of Critical Care Medicine. Chest 1992;101:1644-55.
  4. Bernard GR, Vincent JL, Laterre PF, et al. Recombinant human protein C Worldwide Evaluation in Severe Sepsis (PROWESS) study group. Efficacy and safety of recombinant human activated protein C for severe sepsis. N Engl J Med 2001;344:699-709.
  5. Riou França L, Launois R, Le Lay K, et al. Cost-effectiveness of drotrecogin alfa (activated) in the treatment of severe sepsis with multiple organ failure. Intl J Technol Assess Health Care 2006;22. In press.

  Issues Index | 2006 Issues Index  

Contact ISPOR @ info@ispor.org  |  View Legal Disclaimer
©2010 International Society for Pharmacoeconomics and Outcomes Research.
All rights reserved under International and Pan-American Copyright Conventions.
 
Website design by Eagle Systems USA, Inc.