Join ISPOR  | Sign up for mailing List  | Follow Us: LinkedIn Twitter Facebook YouTube

The Official News & Technical Journal Of The International Society For Pharmacoeconomics And Outcomes Research


Onur Baser, MS, PhD, President, STATinMED Research, Ann Arbor, MI, USA, Assistant Professor of Surgery, Department of Surgery, University of Michigan, Ann Arbor, MI, USA (formerly a Thomson Healthcare employee)

Propensity Score Matching With Multi-Level Categories: An Application

The last decade has seen a broad surge of interest in propensity score matching to estimate average treatment based on observational data. Weitzen [1] reviewed 47 studies contained by searching MEDLINE and Science Citation to identify observational studies that addressed clinical questions using propensity score matching. This literature focuses on models with only two potential states, treatment and non-treatment. However, when evaluating certain treatment programs, a more complex framework appears to be necessary because the actual choice set of individuals contains more than just two options. For example, a physician may have more than two treatment methods, or a drug may be applied in differing dosage levels. In these cases, the conventional form of propensity score matching is inadequate and extensions are necessary.

Imbens [2] and Lechner [3] proposed a methodology that accounts for multilevel treatments. They demonstrated that all major properties obtained by Rubin [4] and by Rosenbaum and Rubin [5] for conventional propensity score matching also hold in this extended framework. However, the application to multilevel treatments was missing in these papers. Because the statistical contents of this extended version are relatively sophisticated, from the viewpoint of the practical researcher, application is avoided. Because of the lack of connection between methodology and application since the technique's introduction, there have been only a few applications of the methodology. Foster illustrated this method in an analysis of dose response in the relationship between the volume of services received and treatment outcomes in mental health [6]. Wang et al. used multiple propensity score matching in a study of the dose-response relationship between diclofenac prescriptions and hospitalization for gastrointestinal bleeding and perforation [7].

In this article, we will bridge the methodology with application. Unlike the previous application studies, we will apply both propensity score matching and multivariate analysis to estimate the average treatment effect to see how robust the estimates are, in relation to the choice of model.

In the next section, we will briefly describe how conventional propensity scores can be extended to multi-level treatment analysis and provide step-by-step instructions for application. We will then apply the methodology to MarketScan® data to estimate the treatment effect among asthma patients with use of controller only, reliever only, and combination medication.

Suppose we are interested in the average causal effect of several treatment options on some outcome. Let a random variable Ti take one of the discrete values, which we index 1,2,...,t. In our example, the response is the asthma medication use, and it takes the values controller only, reliever only, and combination, indexed as 1,2,3. Let πij be the propensity score, which is the conditional probability of receiving a particular level of treatment given the pretreatment variables: πij(X)=Pr(Ti=t | X) . In our example, πij(X) is the probability that i-th patient uses controller only medication given at the pretreatment level.

Step 1
The first step is to estimate πij. Multinomial Logit is the most commonly used specification in discrete choice modeling. Its attractiveness is primarily due to its computational simplicity in calculating choice probabilities. Multinomial logit is appropriate for estimating propensity scores when the values of treatment are qualitatively distinct and without logical ordering. For example, in our example, physicians are choosing from three different treatments: controller only, reliever only, or combination therapy. These treatment choices are distinct and without logical ordering. The multinomial logit model depends on the assumption of independence from irrelevant alternatives. Thus, the choice between controller only and reliever only is assumed to be independent of the other choices (controller only, combination and reliever only, combination).

Step 2
After estimating propensity scores for each category , the second and final step is to estimate the conditional expectation of outcome given the treatment level, where Dt is the binary treatment level indicator.

In other words, the average response at treatment level t is estimated as the average of conditional expectations averaged over the empirical distribution of the treatment variables.

Every step in this process can be easily applied using standard commercial software programs, such as SAS and STATA.

Subject and Databases
Data for this study originated in the MarketScan private insurance database for 1998-2000. In 2005, this database contained information on approximately 13 million persons who were covered by private insurance plans. The following five variables were available in this database: age, gender, International Classification of Disease - 9th Revision (ICD-9), plan type, and geographic region.

An analytic sample consisted of persons who satisfied the following characteristics:

  • Had at least two outpatient claims with a primary or secondary diagnosis of asthma; or
  • Had at least one emergency room claim with a primary diagnosis of asthma, and a transaction for an asthma drug 90 days prior to, or 7 days following, the emergency room claim; or
  • Had at least one inpatient claim with a primary diagnosis of asthma; or
  • Had a secondary diagnosis of asthma and a primary diagnosis of respiratory infection in an outpatient or inpatient claim; or
  • Had at least one drug transaction for a(n) anti-inflammatory agent, oral antileukotrienes, long-acting bronchodilator, or inhaled or oral short-acting beta-agonistic.

To ensure that individual records were complete and that the analytic sample would be representative of the population of patients of interest, a number of exclusions were imposed. Individuals were excluded if they:

  • Had a diagnosis of chronic obstructive pulmonary disease, emphysema, or chronic bronchitis;
  • Were pregnant at some stage during the study period;
  • Were not continuously enrolled in the health plan for 24 months;
  • Were in health maintenance organizations (HMOs) and capitated point of service (POS) plans; or
  • Were elderly, defined as ages 65 and over.

The dependent variable was total health expenditure, calculated as the sum of inpatient, outpatient, and pharmaceutical expenditures for all medical care services. This included all services paid for by insurance, as well as co-payments and deductibles paid out-of-pocket. These values are derived from MarketScan data.

Asthma drugs can be envisioned as being primarily reliever medication or as being primarily controller medication. Therefore, we divided treatment categories into three parts: 1) controller patients took medication (such as inhaled antiinflammatory agents, oral corticosteroids, oral anti-leukotrienes, and long-acting bronchodilators) to control pulmonary inflammation and prevent an acute asthmatic exacerbation; 2) Reliever patients took medication to relieve symptoms in an acute asthmatic exacerbation (i.e., drugs categorized as anti-holinergics or inhaled short-acting beta-agonists); and 3) Combination patients used both controller and reliever medications.

Descriptive Statistics Based on the definitions of asthma episode discussed above, we obtained a sample that included 25,124 patients in fee-for-service (FFS) plans and 6,603 patients in non-FFS plans. Table 1 presents descriptive statistics of the sample, stratified by FFS and non-FFS plans, and then stratified by treatment type. Patients in FFS plans averaged 34 years of age, compared to 35 years for non-FFS plans; the FFS-plan patients were also more likely to be female. In addition, patients in FFS plans were more likely than patients in non-FFS plans to reside in the North Central region. Substantial differences in mean income between the FFS and non-FFS plans were evident from county-level U.S. census data linked to claims data. Patients in FFS plans appear to be sicker than those in non-FFS plans. The former have a higher percentage of asthma-specific comorbidities and have a higher number of major diagnostic categories. As expected from these differences, a likelihood test was conducted to examine whether separate models required for FFS and non-FFS samples concluded that we should estimate separate multinomial logistic models for FFS and non-FFS samples to estimate propensity scores.

In terms of treatment types, patients using combination therapy in FFS plans show substantial differences in demographic and clinical factors relative to patients using reliever only medication. Patients using reliever-only medication were younger, healthier, more likely to live in the North Central region, more likely to have a lower income, and less likely to live in the South. In contrast, patients using controller only medication were more likely to have a higher income level, less likely to live in the North Central region, and more likely to live in the South, relative to patients using combination therapy.

Income differences disappeared among the treatment types for patients in non-FFS plans. The rest of the trends were similar to the group in the FFS plans. Because of these observed differences in patient characteristics, adjustment is necessary to compare the total health care expenditure for each type of treatment. Findings may be confounded because of these differences.

Estimation of Propensity Score
We estimated the probability of being in each treatment group using a multinomial logit regression. Coefficients are the log odds of a patient receiving the reliever medication alone, the controller medication alone, or a combination therapy. Overall, both the FFS model and the non-FFS model significantly estimated the variation in the selection. ((Prob> x2)<0.0000).

For the FFS model, older patients were less likely to be treated with relievers alone or controllers alone, as opposed to combination therapy.

Females were significantly more likely to receive a reliever only treatment rather than a combination treatment. Residents of the North East region (reference category) and those living in a county with the highest category of average income had significantly increased odds of receiving combination therapy rather than controller therapy. There were no significant differences between the reliever only and combination only therapies by residential regions or by the county's average income level. The presence of sinusitis was associated with a decreased likelihood of receiving reliever only therapy or controller only therapy relative to combination therapy. Allergic Rhinitis reduced the odds of receiving a reliever only therapy but did not have a significant impact on controller only therapy. The other co-morbidities exerted no significant impact on the choice of drug therapy.

For patients in non-FFS plans, the effects of age and gender were similar to those in the FFS analysis. Living in the West reduced the odds of receiving reliever only therapy or controller only therapy relative to combination therapy. None of the county-level income variables from the census was statistically significant. The presence of allergic rhinitis, migraine, and sinusitis decreased the odds of receiving reliever only medication relative to combination therapy. For patients using controller therapy, the only co-morbidity associated with a significant effect was the presence of allergic rhinitis, which increased the odds of receiving controller only therapy relative to combination therapy. Higher numbers of unique threedigit ICD-9 codes significantly decreased the odds of receiving a controller therapy relative to combination therapy.

After estimating each model, we calculated the probability of being in each treatment type and used these probabilities as weights to analyze the outcome variables.

Analysis of Outcome Variables In Table 3, we provide the outcome estimates for each of the treatment arms for the FFS and non- FFS group.

The first row presents the unadjusted mean for total health care costs. The difference of total health care cost between reliever therapy and combination therapy was $1,471 for the FFS group and $746 for the non-FFS group. The estimates were $1,298 for the FFS group, and a savings of $340 for the non-FFS group between the controller only therapy and the combination therapy. All of these differences were confounded with patients' demographic and clinical characteristics.

The second row presents propensity score adjusted estimates. The difference of the total health care costs between reliever therapy and combination therapy for the FFS group was $728 - a statistically significant difference. As expected, the difference was smaller than the unadjusted mean, because patients in the reliever group were younger and healthier; therefore, the unadjusted mean for this comparison reflected an upward bias. The propensity score adjusted difference between patients receiving controller only medication and combination medication was $1,216. This adjusted difference represented a difference of only $82 from the unadjusted mean, because these groups of people were similar before the propensity score matching. Therefore, we would anticipate little adjustment in price after controlling for confounding factors. We can see similar trends in the non-FFS group. Reliever only therapy totaled $1,266, controller only therapy was $1,959, and combination therapy totaled $1,933. Although the cost difference between reliever only and combination therapy was significant, there was no evidence that combination therapy cost more than controller only therapy.

We also adjusted the heterogeneity in the sample by using multivariate analysis. Multivariate analysis and propensity score matching control for the observed differences in treatment groups. Therefore multivariate analysis serves as a sensitivity analysis in our application. We modeled health care expenditures as a function of patients' demographic and clinical factors used in the multinomial logit, and we added two dummy variables: one for reliever only and one for controller only. Following the principles proposed by Manning and Mullahy [11], we used a generalized linear model with a log-link function and gamma family. Marginal effects from estimated parameters are presented at the last row in Table 3. The differences in total health care expenditure by each of the three treatment arms were similar to the ones we see in propensity score adjusted differences. For the FFS group, comparing combination therapy with reliever only therapy, the difference was $761, according to multivariate analysis ($728 when compared to propensity score matching). The expenditure difference between controller-only therapy and combination therapy was $1,280 ($1,266 in propensity score matching). For the non-FFS group, the estimated cost of combination therapy was $1,265, reliever only therapy was $841, and controller only therapy was $1,161. The differences in cost estimates according to multivariate and propensity score adjustment were not statistically significant.

Discussion and Conclusion
In an environment where researchers have to rely on observational data, propensity score matching is a great tool to control for confounding factors when estimating average treatment effects. The conventional form, matching on only two groups, is now widely used in health services research. Statistical properties have been investigated by many researchers, and guidelines have been provided to choose among the different types of propensity score matching techniques [12]. Although the extension of conventional propensity score matching is straightforward and easy to apply, there are some basic differences in approach. First, if propensity score matching applies to the two groups, a variable indicating treatment variables and the estimated propensity score based on a regression will have a causal interpretation. This is not true for multilevel treatment [2]. Second, with two groups, group A and group B, based on propensity scores, one is able to examine the differences in outcomes reveals treatment effect. However, extending the pattern to three groups provides biased results. Suppose we matched group A and group B and then, using matched observation in group B, suppose we matched group C. This sort of pattern introduces an artificial sequence into the matching process and results in misleading conclusions. Since all three groups are available to patients simultaneously, multinomial distribution should be used on subgroups, rather than a binomial distribution.

Propensity score matching employs predicted probabilities as a weight to adjust differences in different treatment patterns; therefore, it is crucial to estimate these probabilities consistently. The choice of the model to estimate propensity scores should be carefully considered. In nested models, nested multinomial logit, in the situations where the independence of irrelevant alternative assumptions is violated, conditional multinomial logit and for the other cases multinomial logit can be used to estimate predicted probabilities. Propensity score matching and multivariate analysis both control for the observed differences in the sample to isolate treatment effects. The choices between the two are not clear cut. It has been shown that when there are substantial differences between the treatment groups, multivariate analysis might yield unstable results [13].

Two limitations should be noted. First, if the groups do not have substantial overlap as in the conventional model, propensity score techniques may yield substantial errors. Baser applied a method for estimating the average treatment effect for limited overlap [14]. Second, matching may not eliminate unobserved bias. It is quite possible that the choice in treatment arms depends on physician or practice-related prescribing patterns and the failure to control for these factors can yield biased estimates. The bounding approach and the instrumental variable approach can be used to control for possible unobserved selection bias, but these estimators are confounded by their own limitations [15, 16].

The discussion in this article does not provide rigorous treatment of the theory that underlies propensity score matching with more than three categories. These details can be found elsewhere in the literature [2, 3].

Numerous seminar participants provided comments that greatly improved this work.

1. Weitzen S, Lapane KL, Toledano AY, Hume AL, Mor V. Principles
for modeling propensity scores in medical research: a systematic literature
review. Pharmacoepidemiol Drug Saf 2004;13:841-53.
2. Imbens GW. The role of the propensity score in estimating doseresponse
functions. Biometrika 2000;87:706-10.
3. Lechner M. Identification and estimation of causal effects of multiple
treatments under the conditional independence assumption: IZA;
4. Rubin DB. Inference and missing data. Biometrika 1976;63:581.
5. Rosenbaum PR, Rubin DB. The central role of the propensity
score in observational studies for causal effects. Biometrika
6. Foster EM. Propensity score matching: an illustrative analysis of
dose response. Medical Care 2003;41:1183-92.
7. Wang J, Donnan PT, Steinke D, MacDonald TM. The multiple
propensity score for analysis of dose-response relationships in drug
safety studies. Pharmacoepidemiol Drug Saf 2001;10:105-11.
8. McFadden D. Conditional Logit Analysis of Qualitative Choice
Behavior: Institute of Urban and Regional Development, University of
California; 1973.
9. Tse YK. A Diagnostic Test for the Multinomial Logit Model.
Journal of Business & Economic Statistics 1987;5:283-6.
10. Maddala GS. Limited-Dependent and Qualitative Variables in
Econometrics: Cambridge University Press; 1986.
11. Manning WG, Mullahy J. Estimating log models: to transform or
not to transform? J Health Econ 2001;20:461-94.
12. Baser O. Too much ado about propensity score models?
Comparing methods of propensity score matching. Value Health
13. Baser O. Choosing Propensity Score Matching over Regression
Adjustment fo Causal Inference: When, Why and How it makes
sense. Journal of Medical Economics. 2007;10:379-91.
14. Baser O., Propensity Score Matching with Limited Overlap.
Economics Bulletin 2007;9:1-8.
15. McClellan M, McNeil BJ, Newhouse JP. Does more intensive
treatment of acute myocardial infarction in the elderly reduce mortality?
Analysis using instrumental variables. JAMA 1994;272:859-66.
16. Rubin DB. Estimating Causal Effects from Large Data Sets Using
Propensity Scores. the American College of Physicians; 1997.


  Issues Index | 2008 Issues Index  

Contact ISPOR @  |  View Legal Disclaimer
©2016 International Society for Pharmacoeconomics and Outcomes Research. All rights reserved under International and Pan-American Copyright Conventions. 
Website design by Eagle Systems USA, Inc.