Onur Baser, MS, PhD, President, STATinMED Research, Ann Arbor, MI, USA, Assistant Professor of Surgery, Department of
Surgery, University of Michigan, Ann Arbor, MI, USA (formerly a Thomson Healthcare employee)
Propensity Score Matching With Multi-Level Categories:
The last decade has seen a broad surge of interest
in propensity score matching to estimate
average treatment based on observational data.
Weitzen  reviewed 47 studies contained by
searching MEDLINE and Science Citation to identify
observational studies that addressed clinical
questions using propensity score matching. This
literature focuses on models with only two potential
states, treatment and non-treatment.
However, when evaluating certain treatment programs,
a more complex framework appears to be
necessary because the actual choice set of individuals
contains more than just two options. For
example, a physician may have more than two
treatment methods, or a drug may be applied in
differing dosage levels. In these cases, the conventional
form of propensity score matching is
inadequate and extensions are necessary.
Imbens  and Lechner  proposed a methodology
that accounts for multilevel treatments.
They demonstrated that all major properties
obtained by Rubin  and by Rosenbaum and
Rubin  for conventional propensity score
matching also hold in this extended framework.
However, the application to multilevel treatments
was missing in these papers. Because the statistical
contents of this extended version are relatively
sophisticated, from the viewpoint of the
practical researcher, application is avoided.
Because of the lack of connection between
methodology and application since the technique's
introduction, there have been only a few
applications of the methodology. Foster illustrated
this method in an analysis of dose response in
the relationship between the volume of services
received and treatment outcomes in mental
health . Wang et al. used multiple propensity
score matching in a study of the dose-response
relationship between diclofenac prescriptions
and hospitalization for gastrointestinal bleeding
and perforation .
In this article, we will bridge the methodology with
application. Unlike the previous application studies,
we will apply both propensity score matching
and multivariate analysis to estimate the average
treatment effect to see how robust the estimates
are, in relation to the choice of model.
In the next section, we will briefly describe how
conventional propensity scores can be extended
to multi-level treatment analysis and provide
step-by-step instructions for application. We will
then apply the methodology to MarketScan®
data to estimate the treatment effect among asthma
patients with use of controller only, reliever
only, and combination medication.
Suppose we are interested in the average causal
effect of several treatment options on some outcome.
Let a random variable Ti take one of the
discrete values, which we index 1,2,...,t. In our
example, the response is the asthma medication
use, and it takes the values controller only, reliever
only, and combination, indexed as 1,2,3. Let πij be the propensity score, which is the conditional
probability of receiving a particular level of
treatment given the pretreatment variables: πij(X)=Pr(Ti=t | X) . In our example, πij(X) is the probability that i-th patient uses
controller only medication given at the pretreatment
The first step is to estimate πij. Multinomial Logit
is the most commonly used specification in
discrete choice modeling. Its attractiveness is
primarily due to its computational simplicity in
calculating choice probabilities. Multinomial logit
is appropriate for estimating propensity scores
when the values of treatment are qualitatively distinct
and without logical ordering. For example, in
our example, physicians are choosing from three
different treatments: controller only, reliever only,
or combination therapy. These treatment choices
are distinct and without logical ordering. The
multinomial logit model depends on the assumption
of independence from irrelevant alternatives.
Thus, the choice between controller only and
reliever only is assumed to be independent of the
other choices (controller only, combination and
reliever only, combination).
After estimating propensity scores for each category
, the second and final step is to estimate the
conditional expectation of outcome given the treatment level, where Dt is the
binary treatment level indicator.
In other words, the average response at treatment
level t is estimated as the average of conditional
expectations averaged over the empirical
distribution of the treatment variables.
Every step in this process can be easily applied
using standard commercial software programs,
such as SAS and STATA.
Subject and Databases
Data for this study originated in the MarketScan
private insurance database for 1998-2000. In
2005, this database contained information on
approximately 13 million persons who were covered
by private insurance plans. The following
five variables were available in this database:
age, gender, International Classification of
Disease - 9th Revision (ICD-9), plan type, and
An analytic sample consisted of persons who
satisfied the following characteristics:
- Had at least two outpatient claims with a primary
or secondary diagnosis of asthma; or
- Had at least one emergency room claim with
a primary diagnosis of asthma, and a transaction
for an asthma drug 90 days prior to, or 7
days following, the emergency room claim; or
- Had at least one inpatient claim with a primary
diagnosis of asthma; or
- Had a secondary diagnosis of asthma and a
primary diagnosis of respiratory infection in
an outpatient or inpatient claim; or
- Had at least one drug transaction for a(n)
anti-inflammatory agent, oral antileukotrienes,
long-acting bronchodilator, or
inhaled or oral short-acting beta-agonistic.
To ensure that individual records were complete
and that the analytic sample would be representative
of the population of patients of interest, a
number of exclusions were imposed. Individuals
were excluded if they:
- Had a diagnosis of chronic obstructive pulmonary
disease, emphysema, or chronic
- Were pregnant at some stage during the study
- Were not continuously enrolled in the health
plan for 24 months;
- Were in health maintenance organizations
(HMOs) and capitated point of service (POS)
- Were elderly, defined as ages 65 and over.
The dependent variable was total health expenditure,
calculated as the sum of inpatient, outpatient,
and pharmaceutical expenditures for all
medical care services. This included all services
paid for by insurance, as well as co-payments
and deductibles paid out-of-pocket. These values
are derived from MarketScan data.
Asthma drugs can be envisioned as being primarily
reliever medication or as being primarily controller
medication. Therefore, we divided treatment
categories into three parts: 1) controller
patients took medication (such as inhaled antiinflammatory
agents, oral corticosteroids, oral
anti-leukotrienes, and long-acting bronchodilators)
to control pulmonary inflammation and prevent
an acute asthmatic exacerbation; 2) Reliever
patients took medication to relieve symptoms in
an acute asthmatic exacerbation (i.e., drugs categorized
as anti-holinergics or inhaled short-acting
beta-agonists); and 3) Combination patients
used both controller and reliever medications.
Based on the definitions of asthma episode discussed
above, we obtained a sample that included 25,124 patients in fee-for-service (FFS)
plans and 6,603 patients in non-FFS plans. Table
1 presents descriptive statistics of the sample,
stratified by FFS and non-FFS plans, and then
stratified by treatment type. Patients in FFS plans
averaged 34 years of age, compared to 35 years
for non-FFS plans; the FFS-plan patients were
also more likely to be female. In addition, patients
in FFS plans were more likely than patients in
non-FFS plans to reside in the North Central
region. Substantial differences in mean income
between the FFS and non-FFS plans were evident
from county-level U.S. census data linked to
claims data. Patients in FFS plans appear to be
sicker than those in non-FFS plans. The former
have a higher percentage of asthma-specific comorbidities
and have a higher number of major
diagnostic categories. As expected from these
differences, a likelihood test was conducted to
examine whether separate models required for
FFS and non-FFS samples concluded that we
should estimate separate multinomial logistic
models for FFS and non-FFS samples to estimate
In terms of treatment types, patients using combination
therapy in FFS plans show substantial
differences in demographic and clinical factors
relative to patients using reliever only medication.
Patients using reliever-only medication were
younger, healthier, more likely to live in the North
Central region, more likely to have a lower
income, and less likely to live in the South. In
contrast, patients using controller only medication
were more likely to have a higher income
level, less likely to live in the North Central region,
and more likely to live in the South, relative to
patients using combination therapy.
Income differences disappeared among the treatment
types for patients in non-FFS plans. The rest
of the trends were similar to the group in the FFS
plans. Because of these observed differences in
patient characteristics, adjustment is necessary
to compare the total health care expenditure for
each type of treatment. Findings may be confounded
because of these differences.
Estimation of Propensity Score
We estimated the probability of being in each
treatment group using a multinomial logit regression.
Coefficients are the log odds of a patient
receiving the reliever medication alone, the controller
medication alone, or a combination therapy.
Overall, both the FFS model and the non-FFS
model significantly estimated the variation in the
selection. ((Prob> x2)<0.0000).
For the FFS model, older patients were less likely
to be treated with relievers alone or controllers
alone, as opposed to combination therapy.
Females were significantly more likely to receive a
reliever only treatment rather than a combination
treatment. Residents of the North East region (reference
category) and those living in a county with
the highest category of average income had significantly
increased odds of receiving combination
therapy rather than controller therapy. There were
no significant differences between the reliever only
and combination only therapies by residential
regions or by the county's average income level.
The presence of sinusitis was associated with a
decreased likelihood of receiving reliever only therapy
or controller only therapy relative to combination
therapy. Allergic Rhinitis reduced the odds of
receiving a reliever only therapy but did not have a
significant impact on controller only therapy. The
other co-morbidities exerted no significant impact
on the choice of drug therapy.
For patients in non-FFS plans, the effects of age
and gender were similar to those in the FFS analysis.
Living in the West reduced the odds of
receiving reliever only therapy or controller only
therapy relative to combination therapy. None of
the county-level income variables from the census
was statistically significant. The presence of
allergic rhinitis, migraine, and sinusitis decreased
the odds of receiving reliever only medication relative
to combination therapy. For patients using
controller therapy, the only co-morbidity associated
with a significant effect was the presence of
allergic rhinitis, which increased the odds of
receiving controller only therapy relative to combination
therapy. Higher numbers of unique threedigit
ICD-9 codes significantly decreased the
odds of receiving a controller therapy relative to
After estimating each model, we calculated the
probability of being in each treatment type and
used these probabilities as weights to analyze the
Analysis of Outcome Variables
In Table 3, we provide the outcome estimates for
each of the treatment arms for the FFS and non-
The first row presents the unadjusted mean for
total health care costs. The difference of total
health care cost between reliever therapy and
combination therapy was $1,471 for the FFS
group and $746 for the non-FFS group. The estimates
were $1,298 for the FFS group, and a savings
of $340 for the non-FFS group between the
controller only therapy and the combination therapy.
All of these differences were confounded with
patients' demographic and clinical characteristics.
The second row presents propensity score
adjusted estimates. The difference of the total
health care costs between reliever therapy and
combination therapy for the FFS group was $728
- a statistically significant difference. As expected,
the difference was smaller than the unadjusted
mean, because patients in the reliever group
were younger and healthier; therefore, the unadjusted
mean for this comparison reflected an
upward bias. The propensity score adjusted difference
between patients receiving controller only
medication and combination medication was
$1,216. This adjusted difference represented a
difference of only $82 from the unadjusted mean,
because these groups of people were similar
before the propensity score matching. Therefore,
we would anticipate little adjustment in price after
controlling for confounding factors. We can see
similar trends in the non-FFS group. Reliever only
therapy totaled $1,266, controller only therapy
was $1,959, and combination therapy totaled
$1,933. Although the cost difference between
reliever only and combination therapy was significant,
there was no evidence that combination
therapy cost more than controller only therapy.
We also adjusted the heterogeneity in the sample
by using multivariate analysis. Multivariate analysis
and propensity score matching control for the
observed differences in treatment groups.
Therefore multivariate analysis serves as a sensitivity
analysis in our application. We modeled
health care expenditures as a function of patients'
demographic and clinical factors used in the
multinomial logit, and we added two dummy variables:
one for reliever only and one for controller
only. Following the principles proposed by
Manning and Mullahy , we used a generalized
linear model with a log-link function and gamma
family. Marginal effects from estimated parameters
are presented at the last row in Table 3. The
differences in total health care expenditure by
each of the three treatment arms were similar to
the ones we see in propensity score adjusted differences.
For the FFS group, comparing combination
therapy with reliever only therapy, the difference
was $761, according to multivariate analysis
($728 when compared to propensity score
matching). The expenditure difference between
controller-only therapy and combination therapy
was $1,280 ($1,266 in propensity score matching).
For the non-FFS group, the estimated cost of
combination therapy was $1,265, reliever only
therapy was $841, and controller only therapy
was $1,161. The differences in cost estimates
according to multivariate and propensity score
adjustment were not statistically significant.
Discussion and Conclusion
In an environment where researchers have to rely
on observational data, propensity score matching
is a great tool to control for confounding factors
when estimating average treatment effects. The
conventional form, matching on only two groups,
is now widely used in health services research.
Statistical properties have been investigated by
many researchers, and guidelines have been provided
to choose among the different types of
propensity score matching techniques .
Although the extension of conventional propensity
score matching is straightforward and easy to
apply, there are some basic differences in
approach. First, if propensity score matching
applies to the two groups, a variable indicating
treatment variables and the estimated propensity
score based on a regression will have a causal
interpretation. This is not true for multilevel treatment
. Second, with two groups, group A and
group B, based on propensity scores, one is able
to examine the differences in outcomes reveals
treatment effect. However, extending the pattern
to three groups provides biased results. Suppose
we matched group A and group B and then, using
matched observation in group B, suppose we
matched group C. This sort of pattern introduces
an artificial sequence into the matching process
and results in misleading conclusions. Since all
three groups are available to patients simultaneously,
multinomial distribution should be used on
subgroups, rather than a binomial distribution.
Propensity score matching employs predicted
probabilities as a weight to adjust differences in
different treatment patterns; therefore, it is crucial
to estimate these probabilities consistently. The
choice of the model to estimate propensity
scores should be carefully considered. In nested
models, nested multinomial logit, in the situations
where the independence of irrelevant alternative
assumptions is violated, conditional multinomial
logit and for the other cases multinomial logit can
be used to estimate predicted probabilities.
Propensity score matching and multivariate
analysis both control for the observed differences
in the sample to isolate treatment effects. The
choices between the two are not clear cut. It has
been shown that when there are substantial differences between the treatment groups, multivariate
analysis might yield unstable results .
Two limitations should be noted. First, if the
groups do not have substantial overlap as in the
conventional model, propensity score techniques
may yield substantial errors. Baser applied a
method for estimating the average treatment
effect for limited overlap . Second, matching
may not eliminate unobserved bias. It is quite
possible that the choice in treatment arms
depends on physician or practice-related prescribing
patterns and the failure to control for
these factors can yield biased estimates. The
bounding approach and the instrumental variable
approach can be used to control for possible
unobserved selection bias, but these estimators
are confounded by their own limitations [15, 16].
The discussion in this article does not provide
rigorous treatment of the theory that underlies
propensity score matching with more than three
categories. These details can be found elsewhere
in the literature [2, 3].
Numerous seminar participants provided comments
that greatly improved this work.
1. Weitzen S, Lapane KL, Toledano AY, Hume AL, Mor V. Principles
for modeling propensity scores in medical research: a systematic literature
review. Pharmacoepidemiol Drug Saf 2004;13:841-53.
2. Imbens GW. The role of the propensity score in estimating doseresponse
functions. Biometrika 2000;87:706-10.
3. Lechner M. Identification and estimation of causal effects of multiple
treatments under the conditional independence assumption: IZA;
4. Rubin DB. Inference and missing data. Biometrika 1976;63:581.
5. Rosenbaum PR, Rubin DB. The central role of the propensity
score in observational studies for causal effects. Biometrika
6. Foster EM. Propensity score matching: an illustrative analysis of
dose response. Medical Care 2003;41:1183-92.
7. Wang J, Donnan PT, Steinke D, MacDonald TM. The multiple
propensity score for analysis of dose-response relationships in drug
safety studies. Pharmacoepidemiol Drug Saf 2001;10:105-11.
8. McFadden D. Conditional Logit Analysis of Qualitative Choice
Behavior: Institute of Urban and Regional Development, University of
9. Tse YK. A Diagnostic Test for the Multinomial Logit Model.
Journal of Business & Economic Statistics 1987;5:283-6.
10. Maddala GS. Limited-Dependent and Qualitative Variables in
Econometrics: Cambridge University Press; 1986.
11. Manning WG, Mullahy J. Estimating log models: to transform or
not to transform? J Health Econ 2001;20:461-94.
12. Baser O. Too much ado about propensity score models?
Comparing methods of propensity score matching. Value Health
13. Baser O. Choosing Propensity Score Matching over Regression
Adjustment fo Causal Inference: When, Why and How it makes
sense. Journal of Medical Economics. 2007;10:379-91.
14. Baser O., Propensity Score Matching with Limited Overlap.
Economics Bulletin 2007;9:1-8.
15. McClellan M, McNeil BJ, Newhouse JP. Does more intensive
treatment of acute myocardial infarction in the elderly reduce mortality?
Analysis using instrumental variables. JAMA 1994;272:859-66.
16. Rubin DB. Estimating Causal Effects from Large Data Sets Using
Propensity Scores. the American College of Physicians; 1997.