Considerations in Comparing Groups of People
(The following is taken from the Third Plenary Session, “Patient Reported
Outcomes: Implementing Good Research Practices,” presented at the ISPOR
13th Annual International Meetings, May 6, 2008, Toronto, ON, Canada)
Traditional clinical measures (e.g., vital signs) are very commonly used and
understood. As an example, measures taken from a patient could be 140/90 or
98.6º F (37.0º C), and we would all know we are talking about blood pressure
and temperature. These measures are very important when we are trying to
figure out how a patient is doing. If a patient does not look as well as he did,
but yet, we can get some very useful information that go beyond vital signs by
asking people how they are feeling. With the vital signs, we really do not have
to interact to take some measures.
When dealing with PRO's, specifically health related quality of life, we must go
to the person directly and ask them a series of questions. To continue with the
previous example, this patient can tell us they are actually doing quite well (their
health is excellent, they get along well with their wife, have a lot of energy, they
are good at their job, and they can walk a block). We can probably find out a
lot more with additional questions. The one negative thing that this person said
is his vision is bad, but overall this person seems to be doing pretty well in
terms of the health related quality of life measures. When dealing with health
related quality of life, there are two main areas: what people are able to do and
how they feel about their life. This is the focus is and this is important information
to get along with other information that is clinically based. However,
when we look at these measures it is important to be skeptical about them.
In fact, any kind of data or measures should be met with a level of skepticism.
We should not just be skeptical about the patient reported outcomes, we also
need to be skeptical about vital signs and other clinical measures; many of
these measures have problems. They rely on clinician's judgment, their perception,
their interpretation and often different clinicians do not agree with one
another and they do not necessarily agree with themselves over time, so there
is some test-retest unreliability. And it is the case, if in one situation clinicians
have under diagnosed and they find that out, then in the future they may try to
compensate and go in the other direction; there is bias going the other way.
Therefore, it is important to have a balance in terms of our skepticism about
measures in general. In fact, in 2007, there was an article published in the
Mayo Clinical Proceedings, first-authored by Beth Hahn, which looked at the
precision of health related quality of life measures in comparison to clinical
measures and basically showed that they were very equal under most circumstances,
so that they are interchangeable in terms of their reliability.
One of the ways that patient reported outcomes are monitored over time, can
be illustrated by an example from the Centers for Disease Control in the United
States. They have a behavioral risk factor surveillance system. For every year
since 1993, they have done telephone surveys of U.S. adults to get a random
sample of people asked basic health related quality of life questions. One
question they ask is the most widely used single-item question: “Would you
say, in general, your health is excellent or poor?” When you look at adults in
the United States, the lower two categories, fair or poor health, are reported by
about 16% over all, and it is pretty consistent from 1993 to 2006.
You can look at the percentage of people who report their health to be fair or poor
by subgroups – for example, by age. What we see is what might be expected:
as people get older, they tend to report worse health. The difference is 33%
reporting fair or poor health for those 75 + versus 9% for those 18 to 24 with a
pretty nice gradient at each age category as you get older (see Graph 1).
|Graph 1. Percentage with Fair or Poor Self-rated Health by Age Group
You could also arrange the data in terms of gender-males versus females.
Females tend to report worse health than males in the United States. So there
is a greater percentage reporting fair or poor health, but the difference is very
small (17% versus 15%). The green line is from 1993 to 2006, that is females
and below that is males. Is this a real difference or is it for some other reason?
Is it a reporting difference, a different style of reporting, etc? (See Graph 2)
|Graph 2. Percentage with Fair or Poor Self-rated Health by Gender
Because these are bivariate comparisons, there is no adjustment for anything
else. As previously shown, older age is associated with worse self reported
health. With adjustment for age, the difference should shrink because it is
known that U.S. women live longer than men and therefore there are older
women than men in the comparison. Even with case mix adjustment, there still
may be a difference because there are other datasets, for example, Dennis
Fryback and others published in Medical Care in 2007, national estimates of self
rated health and found gender differences even within age groups. Thus, the
data on the whole suggests very small differences with females in the United
States tending to report a little bit worse health than males.
Data of this type still does not reveal if there really is a difference or not. It
cannot necessarily be assumed that there is equivalent data for the different
subgroups that is being comparing. Therefore, it is very important that we
evaluate the data in terms of equivalence, by important individual characteristics, such as age, gender, race, language, and site. The site could be a
micro-site; it also could be a macro-site like by country. In some cases we are
interested in comparing the health related quality of life by country. And in
addition we want to compare, in terms of administrative effects like the order
in which questions are administered, the time, whether it is pre or post assessment;
there could be some response shift going on, the mode of administration,
whether it is mail or phone or some other mode, IVR, or so on, and maybe
form. There could be alternative forms that we are looking at and we want to
know if they are equivalent or not, because they are being used as if they were.
So there are a lot of things that needs to be evaluated to insure that we are really
getting equivalent data to make the comparisons are meaningful. Is this
difference by gender a real difference or is there something else happening?
Item response theory is a powerful way of examining equivalence.
Confirmatory factor analysis is a very useful approach and is very closely related
to item response theory. In item response theory there is a focus on category
response curves that show the relationship between an estimates of
where the person is on the concept or construct that is being measuring and
the likelihood of each possible response. Assume there is a series of items
measuring that can align people along that continuum from severe to very low
depression. We can look at the probability of people responding in each category
for the individual items. Here is one of the items and it has five possible
categories (see Graph 3).
|Graph 3. Category Response Curve for Depressive Symptom Item
This is actually a good result, where the probability is much higher you are
going to respond “always” if you are at the very severe end of the continuum
and if much less likely if you are at the very low end of the continuum. You
have an ordering that fits the rank order of the categories that you expect.
The category response curves are useful for looking at differential item functioning.
Because the relationship can be modeled between an estimate of
where someone is on the thing that is being measured and the probability of
responding in each category, the sample can then be segmented into whatever
groups are of interest to see whether the groups have the same pattern of
responding or not (see Graph 4).
|Graph 4. Examples of Differential Item Functioning (blue represents U.S.
and purple represents Canada)
The further we travel to the right along this continuum, the greater the probability
of saying “yes” to the item for everybody. But in the hypothetical example
shown on the left, Canadians and people from the U.S. who have a 50%
probability of saying yes are located on different places on the underlying continuum.
It does not take as much of the construct to have an equal probability
of saying yes if you are in the U.S. versus Canada. That is an indicator of differential
item functioning or lack of equivalence represented by parallel lines that
do not intersect (“location” difference). Everything is similar throughout the
continuum in terms of the difference being seeing. Another type of differential
item functioning would be an interaction in the lines, meaning that the difference
seen between the ways people respond depends on where it is on the continuum.
This “slope” difference is a more complicated lack of equivalence.
Item response theory can be used in the same way in terms of developing
questions as always, but there is more flexibility in what can be done with the
data. So you can have a series of items from different places, and even write
some new items, put them in an item pool, subject them to the same qualitative
methods as always and then conduct psychometric analyses. But when
you conduct the psychometric analyses, you then are able to put everything on
the same calibration or scale so you get an estimate of every item relative to
another items in terms of where they fall on the continuum. This allows flexibility
in future administrations – given that all of the items are calibrated together
you can select short forms or do computer adapted testing that allows you to
administer a subset of items and still get an estimate of the underlying construct
that is on the same metric. In addition to the category response curves
that we saw, we also have item information functions, which will help us in
We can look at item information curves for each item in a five-item measure of
depressive symptoms. Information tells us the equivalent of reliability – the higher
the information, the better the reliability or precision. These curves show where
the maximum information is obtained for each item (see Figure 5).
|Figure 5. Depressive symptoms item information functions
At the extremes, there is not as much information. We can compare one of the
items, “I felt unhappy,” with another item, “I felt depressed.” In this comparison
it is apparent that the latter item provides more information. The shapes
of item information are often different and information over the continuum
improves with multiple items. A computer adaptive test can be administered
in which an estimate of the person's score is used to select the most informative
items for that person. Each time an item is administered, the estimate is
updated and the most informative item is chosen. This can be stopped when
one has reached whatever level of precision is needed, given the constraints of
The Patient-Reported Outcomes Measurement Information System (PROMIS)
project is an example of how item response theory and computer adaptive testing
is being used. There is a PROMIS depressive symptoms item bank as well as several other item banks but that item bank has turned out to have 28 items
that are adapted from the existing literature or written to fill gaps. The items
have been calibrated together. In a sample of about 800 people the 28 items
in the bank as a whole and short forms derived from them can be evaluated.
One short form that was derived was based on 8 questions that spanned the
range of the underlying construct of depression, symptoms from least to most
severe (Form A). There are all 28 items with estimates of where they fall on the
continuum. Eight were picked that span that continuum from “I felt disappointed
in myself,” to “I felt I had no reason for living” (Seong Choi, personal communication).
The items are administered with a past seven days recall interval
using a never to always response scale. The “I felt disappointed in myself” is
an item that a lot of people would be likely to give a response to other than
“never” – they are more likely to endorse this item than other items. It is an easier
item in a sense so it is at the least severe end of the continuum. At the other
extreme is “I felt I have no reason for living.” Very few people are going to say
anything other than “never” in response to this item. Items should span the
range so that you get the most appropriate items depending on where someone
happens to lie on the underlying continuum. There are eight other items selected
that constitute an alternative short form (Form B). The items in Form B are
similar, but different than the Form A items. “I felt sad” is the easiest to endorse
item while “I felt I wanted to give up on everything,” is the hardest.
This can be compared with an eight-item computer adaptive test, where the
best eight items for every individual are picked. Each individual could have different
items administered. When this is compared within PROMIS data, you
see mean scores; this is on a on a T score metric where the mean is 50 in the
calibrated sample. But the same mean, regardless of what short form, whether
it was a computer adaptive test or the full bank of all 28 items. The minimum and maximum are similar, but the CAT and the full bank do a little better representing
the range. There is a larger range in those two, and that is what you
might expect, because CAT and full bank are better, all things being equal. The
correlations between the alternative forms are very strong correlations, especially
with each short form and the full bank, 0.95 and above, and the computeradaptive
test correlates 0.98 with the full bank. Regardless of which of these
short forms or computer adaptive tests are used, the full set of 28 items is represented
well. The distributions of the computer-adaptive test scores are a
little bit better and the pool of items is the best, but that there is not much
difference in the distributions regardless of which form you use.
Finally, person fit can also be evaluated and it represents the extreme case of
differential item functioning or lack of equivalence. In IRT, it can be seen
whether the person fits the model that is being used, and if they do not fit the
model, this means the responses are really not consistent with what the model
is saying. They may answer one item one way and then you would predict they
would score high on another, but they score low and if they do that enough,
they get a very high index of person misfit, indicating they are not fitting the
Patient-reported outcomes (PRO's) are as reliable as other measures of patient
health. But it is important to demonstrate the equivalence of PRO's for the
different groups that you are looking at substantively and IRT provides a very
strong empirical basis for doing this.
*Preparation for the presentation from which this article is based was supported
in part by the National Institutes of Health through the NIH Roadmap for
Medical Research Grant (AG015815), PROMIS Project