Join ISPOR  | Sign up for mailing List  | Follow Us: LinkedIn Twitter Facebook YouTube
The Official News & Technical Journal Of The International Society For Pharmacoeconomics And Outcomes Research

Considerations in Comparing Groups of People with PROs

Ron Hays PhD, Professor of Medicine, UCLA Department of Medicine, Division of General Internal Medicine & Health Services Research, Los Angeles, CA, USA

(The following is taken from the Third Plenary Session, “Patient Reported Outcomes: Implementing Good Research Practices,” presented at the ISPOR 13th Annual International Meetings, May 6, 2008, Toronto, ON, Canada)

Traditional clinical measures (e.g., vital signs) are very commonly used and understood. As an example, measures taken from a patient could be 140/90 or 98.6º F (37.0º C), and we would all know we are talking about blood pressure and temperature. These measures are very important when we are trying to figure out how a patient is doing. If a patient does not look as well as he did, but yet, we can get some very useful information that go beyond vital signs by asking people how they are feeling. With the vital signs, we really do not have to interact to take some measures.

When dealing with PRO's, specifically health related quality of life, we must go to the person directly and ask them a series of questions. To continue with the previous example, this patient can tell us they are actually doing quite well (their health is excellent, they get along well with their wife, have a lot of energy, they are good at their job, and they can walk a block). We can probably find out a lot more with additional questions. The one negative thing that this person said is his vision is bad, but overall this person seems to be doing pretty well in terms of the health related quality of life measures. When dealing with health related quality of life, there are two main areas: what people are able to do and how they feel about their life. This is the focus is and this is important information to get along with other information that is clinically based. However, when we look at these measures it is important to be skeptical about them.

In fact, any kind of data or measures should be met with a level of skepticism. We should not just be skeptical about the patient reported outcomes, we also need to be skeptical about vital signs and other clinical measures; many of these measures have problems. They rely on clinician's judgment, their perception, their interpretation and often different clinicians do not agree with one another and they do not necessarily agree with themselves over time, so there is some test-retest unreliability. And it is the case, if in one situation clinicians have under diagnosed and they find that out, then in the future they may try to compensate and go in the other direction; there is bias going the other way. Therefore, it is important to have a balance in terms of our skepticism about measures in general. In fact, in 2007, there was an article published in the Mayo Clinical Proceedings, first-authored by Beth Hahn, which looked at the precision of health related quality of life measures in comparison to clinical measures and basically showed that they were very equal under most circumstances, so that they are interchangeable in terms of their reliability.

One of the ways that patient reported outcomes are monitored over time, can be illustrated by an example from the Centers for Disease Control in the United States. They have a behavioral risk factor surveillance system. For every year since 1993, they have done telephone surveys of U.S. adults to get a random sample of people asked basic health related quality of life questions. One question they ask is the most widely used single-item question: “Would you say, in general, your health is excellent or poor?” When you look at adults in the United States, the lower two categories, fair or poor health, are reported by about 16% over all, and it is pretty consistent from 1993 to 2006.

You can look at the percentage of people who report their health to be fair or poor by subgroups – for example, by age. What we see is what might be expected: as people get older, they tend to report worse health. The difference is 33% reporting fair or poor health for those 75 + versus 9% for those 18 to 24 with a pretty nice gradient at each age category as you get older (see Graph 1).


Graph 1. Percentage with Fair or Poor Self-rated Health by Age Group

You could also arrange the data in terms of gender-males versus females. Females tend to report worse health than males in the United States. So there is a greater percentage reporting fair or poor health, but the difference is very small (17% versus 15%). The green line is from 1993 to 2006, that is females and below that is males. Is this a real difference or is it for some other reason? Is it a reporting difference, a different style of reporting, etc? (See Graph 2)

Graph 2. Percentage with Fair or Poor Self-rated Health by Gender

Because these are bivariate comparisons, there is no adjustment for anything else. As previously shown, older age is associated with worse self reported health. With adjustment for age, the difference should shrink because it is known that U.S. women live longer than men and therefore there are older women than men in the comparison. Even with case mix adjustment, there still may be a difference because there are other datasets, for example, Dennis Fryback and others published in Medical Care in 2007, national estimates of self rated health and found gender differences even within age groups. Thus, the data on the whole suggests very small differences with females in the United States tending to report a little bit worse health than males.

Data of this type still does not reveal if there really is a difference or not. It cannot necessarily be assumed that there is equivalent data for the different subgroups that is being comparing. Therefore, it is very important that we evaluate the data in terms of equivalence, by important individual characteristics, such as age, gender, race, language, and site. The site could be a micro-site; it also could be a macro-site like by country. In some cases we are interested in comparing the health related quality of life by country. And in addition we want to compare, in terms of administrative effects like the order in which questions are administered, the time, whether it is pre or post assessment; there could be some response shift going on, the mode of administration, whether it is mail or phone or some other mode, IVR, or so on, and maybe form. There could be alternative forms that we are looking at and we want to know if they are equivalent or not, because they are being used as if they were. So there are a lot of things that needs to be evaluated to insure that we are really getting equivalent data to make the comparisons are meaningful. Is this difference by gender a real difference or is there something else happening?

Item response theory is a powerful way of examining equivalence. Confirmatory factor analysis is a very useful approach and is very closely related to item response theory. In item response theory there is a focus on category response curves that show the relationship between an estimates of where the person is on the concept or construct that is being measuring and the likelihood of each possible response. Assume there is a series of items measuring that can align people along that continuum from severe to very low depression. We can look at the probability of people responding in each category for the individual items. Here is one of the items and it has five possible categories (see Graph 3).

Graph 3. Category Response Curve for Depressive Symptom Item

This is actually a good result, where the probability is much higher you are going to respond “always” if you are at the very severe end of the continuum and if much less likely if you are at the very low end of the continuum. You have an ordering that fits the rank order of the categories that you expect.

The category response curves are useful for looking at differential item functioning. Because the relationship can be modeled between an estimate of where someone is on the thing that is being measured and the probability of responding in each category, the sample can then be segmented into whatever groups are of interest to see whether the groups have the same pattern of responding or not (see Graph 4).

Graph 4. Examples of Differential Item Functioning (blue represents U.S. and purple represents Canada)

The further we travel to the right along this continuum, the greater the probability of saying “yes” to the item for everybody. But in the hypothetical example shown on the left, Canadians and people from the U.S. who have a 50% probability of saying yes are located on different places on the underlying continuum. It does not take as much of the construct to have an equal probability of saying yes if you are in the U.S. versus Canada. That is an indicator of differential item functioning or lack of equivalence represented by parallel lines that do not intersect (“location” difference). Everything is similar throughout the continuum in terms of the difference being seeing. Another type of differential item functioning would be an interaction in the lines, meaning that the difference seen between the ways people respond depends on where it is on the continuum. This “slope” difference is a more complicated lack of equivalence.

Item response theory can be used in the same way in terms of developing questions as always, but there is more flexibility in what can be done with the data. So you can have a series of items from different places, and even write some new items, put them in an item pool, subject them to the same qualitative methods as always and then conduct psychometric analyses. But when you conduct the psychometric analyses, you then are able to put everything on the same calibration or scale so you get an estimate of every item relative to another items in terms of where they fall on the continuum. This allows flexibility in future administrations – given that all of the items are calibrated together you can select short forms or do computer adapted testing that allows you to administer a subset of items and still get an estimate of the underlying construct that is on the same metric. In addition to the category response curves that we saw, we also have item information functions, which will help us in this process.

We can look at item information curves for each item in a five-item measure of depressive symptoms. Information tells us the equivalent of reliability – the higher the information, the better the reliability or precision. These curves show where the maximum information is obtained for each item (see Figure 5).

Figure 5. Depressive symptoms item information functions

At the extremes, there is not as much information. We can compare one of the items, “I felt unhappy,” with another item, “I felt depressed.” In this comparison it is apparent that the latter item provides more information. The shapes of item information are often different and information over the continuum improves with multiple items. A computer adaptive test can be administered in which an estimate of the person's score is used to select the most informative items for that person. Each time an item is administered, the estimate is updated and the most informative item is chosen. This can be stopped when one has reached whatever level of precision is needed, given the constraints of our measure.

The Patient-Reported Outcomes Measurement Information System (PROMIS) project is an example of how item response theory and computer adaptive testing is being used. There is a PROMIS depressive symptoms item bank as well as several other item banks but that item bank has turned out to have 28 items that are adapted from the existing literature or written to fill gaps. The items have been calibrated together. In a sample of about 800 people the 28 items in the bank as a whole and short forms derived from them can be evaluated.

One short form that was derived was based on 8 questions that spanned the range of the underlying construct of depression, symptoms from least to most severe (Form A). There are all 28 items with estimates of where they fall on the continuum. Eight were picked that span that continuum from “I felt disappointed in myself,” to “I felt I had no reason for living” (Seong Choi, personal communication). The items are administered with a past seven days recall interval using a never to always response scale. The “I felt disappointed in myself” is an item that a lot of people would be likely to give a response to other than “never” – they are more likely to endorse this item than other items. It is an easier item in a sense so it is at the least severe end of the continuum. At the other extreme is “I felt I have no reason for living.” Very few people are going to say anything other than “never” in response to this item. Items should span the range so that you get the most appropriate items depending on where someone happens to lie on the underlying continuum. There are eight other items selected that constitute an alternative short form (Form B). The items in Form B are similar, but different than the Form A items. “I felt sad” is the easiest to endorse item while “I felt I wanted to give up on everything,” is the hardest.

This can be compared with an eight-item computer adaptive test, where the best eight items for every individual are picked. Each individual could have different items administered. When this is compared within PROMIS data, you see mean scores; this is on a on a T score metric where the mean is 50 in the calibrated sample. But the same mean, regardless of what short form, whether it was a computer adaptive test or the full bank of all 28 items. The minimum and maximum are similar, but the CAT and the full bank do a little better representing the range. There is a larger range in those two, and that is what you might expect, because CAT and full bank are better, all things being equal. The correlations between the alternative forms are very strong correlations, especially with each short form and the full bank, 0.95 and above, and the computeradaptive test correlates 0.98 with the full bank. Regardless of which of these short forms or computer adaptive tests are used, the full set of 28 items is represented well. The distributions of the computer-adaptive test scores are a little bit better and the pool of items is the best, but that there is not much difference in the distributions regardless of which form you use.

Finally, person fit can also be evaluated and it represents the extreme case of differential item functioning or lack of equivalence. In IRT, it can be seen whether the person fits the model that is being used, and if they do not fit the model, this means the responses are really not consistent with what the model is saying. They may answer one item one way and then you would predict they would score high on another, but they score low and if they do that enough, they get a very high index of person misfit, indicating they are not fitting the model.

Patient-reported outcomes (PRO's) are as reliable as other measures of patient health. But it is important to demonstrate the equivalence of PRO's for the different groups that you are looking at substantively and IRT provides a very strong empirical basis for doing this.

*Preparation for the presentation from which this article is based was supported in part by the National Institutes of Health through the NIH Roadmap for Medical Research Grant (AG015815), PROMIS Project

  Issues Index | 2008 Issues Index  

Contact ISPOR @  |  View Legal Disclaimer
©2016 International Society for Pharmacoeconomics and Outcomes Research. All rights reserved under International and Pan-American Copyright Conventions. 
Website design by Eagle Systems USA, Inc.