Good Research Practices for Evaluating and Documenting Content Validity for Use of Existing PRO Instruments and their Modification: A DRAFT Report of the ISPOR PRO Task Force on the Use of Existing Instruments & Their Modifications Comments


25 Different Reviewers or Combined Groups of Reviewer Comments

1.  This was a very good read and addresses a lot of issues where there has been uncertainty.
One aspect missing is when the new population being assessed is almost similar to the population the instrument was developed in but not completely. For example if the instrument was developed in a 18-65 age group and the trial will enroll 44-70 age group, is there a need to do additional focus groups or will cognitive debriefing will be sufficient. More importantly is there a need to do anything?

Similarly if an instrument was developed for moderate to severe patients for a condition and the clinical trial includes 1) mild-moderate and severe patients  or 2)moderate patients. For the first situation there appears a need for cognitive debriefing and for the second the instrument can be used as is. Please clarify whether my understanding based on the content of this report is correct.

Thanks, AF
---------------------------------
2.  Thank you for the opportunity to review this interesting paper.  On the whole I think this is a good summary of the issues related to content validation.  I do have a couple of concerns though and would be interested in your views on them.

We all agree that qualitative data from pts is a very important driver of content validity.  However I think the paper would benefit from some more in depth discussion of the qualitative techniques (including analysis) that can be applied to generate these data.  One concern that I have had for a long time is that the content of the discussion in the focus groups is driven in large part by the structure of the discussion guide.  You ask about X and people tell you about X.  Clearly there needs to be a process whereby you decide what should be asked (and this should reflect important issues for patients rather than important issues for products).  There is a risk that the content validity of the instrument is dependent on the validity of the discussion guide.  Certain steps can be taken to address this.  It would be very useful to describe how the qualitative work should be iterative, and at an early stage could lean on some of the methods from grounded theory so that concepts emerge from the pts rather than from the discussion guide.  Also of course the quality of much of this data is highly dependent on the skills of the moderator or interviewer and this should not be underestimated. 

Related to this the paper may benefit from some more detailed discussion how these data should be analysed.  For example what are the important approaches to analyses which are needed to help support content validity. 

Anyway other than these issues I think this is a valuable contribution and the case studies are very useful. 

----------------

3. My overall comment is that this is a well written report clearly covering the scope of the task performed and presented in a way that is likely to meet the broad spectrum of ISPOR members having an interest in this topic.
 My suggestions would be as follows: 
1) In the abstract: Include a short definition of 'content validity' i.e. 'addresses the issue of the adequacy of the instrument to measure what it purports to measure'
2) In the introduction: 4th line should 'IPOR' read 'ISPOR'?
3) In Content Validity - A Definition: Page 4 para commencing 'PRO instruments... The term 'intersection' used here and in the following para, gave me an impression of precision. Is this intended or would a softer term be more appropriate e.g. 'convergence'?
4) In Data Collection Methods: The skills required to effectively run Focus Groups had previously been well described. In this section however no mention is made of the skills required for 1:1 interviewing. Should reference to this be included?
5) In Data Collection Methods: Para 2 the word 'saturation' is used for the first time. Perhaps the inclusion of '(defined and discussed below)' would be more appropriate here than later on the 'Data Analysis section.
6) In Data Collection Methods': Para 4 'Cognitive debrief interviews can be performed...Would 'may' or should' be more informative for the reader? 
Thank you for the opportunity to review this draft and I hope the above comments are helpful.
 Only one mildly negative comment however from a reviewer's perspective is that a document presented with line numbering is much easier to provide feedback on than the format provided in this case.   
I look forward to receiving the finalised report in due course.
--------------------------

4.  ….a constructive comment that is related to the content of the document and should be forwarded to the task force for consideration.

Nowhere in the document is the term "face-validity" mentioned.   Streiner et al believe the following:   “The terms face validity and “content validity” are technical descriptions of a judgment that a scale looks reasonable...These two forms of validity consist of a judgement by experts whether the scale appears appropriate for [its] intended purpose.”    In the spirit of better incorporating ISPOR position papers into the biomedical literature and also making them more user friendly to a healthcare professional, I think that “content validity” is synonymous with “face-validity”.  This is also akin to saying that a PRO instrument is a questionnaire, is a health measurement scale.
What I would like to see somewhere in the document is the following statement:
Content validity is sometimes thought of as face validity and both may be viewed as similar concepts”.
 Streiner DL, Norman GR.  Health Measurement Scales.  A Practical Guide to their Development and Use.  2nd ed.  Oxford: Oxford University Press; 1995. p. 5.

5. To Whom It May Concern:
I've now had an opportunity to read over the manuscript and really don't
have many comments. It's written in a manner that is very clear and to the
point, which I think readers will appreciate. I do have some comments (some
are very minor) below which are in order of appearance in the manuscript:

1. I think you meant ISPOR Health Science Policy Council (rather than IPOR)
in the first paragraph of the INTRO.
2. You mention that scoring is part of the instrument as a whole in the
second bullet point in the definition for Content Validity on Page 3. To me
scoring suggests interpretation, but I'm thinking you meant to say something
more related to how the instrument is scored, including whether there are
subscale scores, total scores, etc. In any event, please clarify.
3. You mention that PRO measures can serve as primary, secondary, or
exploratory endpoints on the first paragraph of Page 5. Perhaps it's as
simple as adding an and/or, but it's possible that a PRO instrument can have
individual subscales some of which are primary, secondary and exploratory
endpoints.  Also you almost always use instrument, but in this sentence you
use measure. Are you making a distinction that I'm not aware of? Please
stick to the use of instrument throughout.
4.You mention that patients should be asked about the importance of any
aspects of a concept that were not addressed in the first paragraph on Page
9.  However, you don't suggest what researchers should do with this info. For
instance, how important does it need to be or is there a certain % of
patients that need to mention this aspect before it's considered construct
underrepresentation?
On the top of Page 15 you mention that "patients in the study use
different"...It should be revised to read as: "patients in the focus group"

General Comments:
1. You mention in several places that changes to an existing instrument
should be considered. This being the addition of new items, changes to a
scale name, etc. However, the importance of working with the developer, or
at least seeking permission for these changes should be mentioned somewhere...
2. It is not clear how one would achieve saturation within a focus group
setting without the use of one-on-one interviews and operationally how that
would be handled. For instance, researchers might achieve saturation after
the conduct of a third focus group, yet the 4th one is already scheduled.
3. I'm really not sure what the value of the 3 figures are? (Also a
reference to Figure 1 appears last within the text).

Thanks for the opportunity of allowing us to provide comments. And again,
job well done!
----------------------------

6. Many thanks for sending this paper.  I found it very helpful, especially given where project TED is right now, and the scenarios were helpful in illustrating the points.  I noticed a few things as I was reading through as noted below. 
pg 12 -- not sure I totally understand the sentence below.  Can it be clarified?
"For example, in symptom measures, the frequency, intensity, and duration of symptoms are attributes for individual symptoms and those attributes that patients identify as important should be measured for all symptoms."
 Scenario A: typo in remediation 2.  Literature search
 Scenario C: last sentence of the scenario description needs to be reviewed -- doesn't read well
 In addition, did the working group give any thought to outlining the type of information (data package) that needs to be provided from the focus groups or other work to support content validity?  I'm thinking specifically about copies of coded transcripts when I mention this, but there are probably other things that might be relevant that the FDA could ask for.  The paper does a nice job of outlining the steps, but what needs to be submitted to the FDA or other regulatory authorities might be a relevant piece also?
Hope all is well with you, Carol
----------------------------

7. General comments
I like the report's focus content validity.  This topic is the most important issue to the largest number of people. Researchers can use it to support their design choices.  Clinicians can use it in choosing instruments.  Students can use it in understanding the link between PRO measurement and efficacy of medications.   However, I find that the document is not user-friendly to anyone but PhD scientists versed in psychometrics.  
I think that an opportunity is being missed here.  I see this can be a tool for teaching clinicians and students.  PROs are a vague and not well understood.  With some editing, this can be used as a primer to educate individuals without significant research background.  Content validity is particularly useful for training because it relies more on understanding the clinical impact of drugs than psychometric measurement.
If this is to be a primer, greater clarity is needed.  The rationale for the steps needs to be laid out in more detail.  The reasons for choices need to be explained.  And terms need to be defined.  Consider a glossary of definitions at the end of the report.

Specific comments
Page 2 Introduction - fourth line "IPOR [ISPOR?] Health Science Policy Council..."
Page 3 Content Validity - A definition  

Figure 1 is not clear to me.  The figure does not stand alone without explanation and the explanation in the text minimal.  It looks like a good figure -- it just needs more explanation.
Some the language may be better if simplified.  For instance, "In summary, the relationship between the intended measurement concept and the methods used to develop and select items and evaluate the content and formulation of the instrument are the core of content validity, with a detailed description of these methods and results used as evidence that the proposed use and interpretation of scores from the
instrument represent the intended concept, and therefore, possess content validity (1-5)."  An alternative -- In summary, content validity describes the degree to which a measure represents all relevant dimensions of what it intends to measure.  This is determined by the methods used to develop, select, and evaluate content and the transparency of those methods (1-5).  
Page 5 -- Please provide a greater explanation of the Target Product Profile.  In industry, this may be perfectly clear but with clinicians and others it requires further explanation.
Page 7 -- Please explain figure 1.  It needs greater explanation.
Page 8 -- A brief explanation of measurement theory might help readers understand why data is collected the way it is.  The rationale should be made clear.
Page 8 -- Terms like the "universe of content" need to be better defined.
Page 12 and 13 -- I like the case examples of threats to validity.
-------------------------------

8.
 This is a very comprehensive coverage of content validity. I honestly could not think of any topic that was not addressed. It is very long and some readers may find it too detailed but they can always skip to the next section as it is very nicely broken into segments. I am sure I will refer to it in coming months, as you know this is a topic I am working with all the time. It will also be helpful as we work though the details of our MCD instrument and try to document the development of content validity.  Congrats!  Bonnie
---------------------------------

9.
Consider a different source for the definition of concept
“Concept” is “an abstract or generic idea generalized from particular instances” (http://www.merriam-webster.com/dictionary/concept, accessed March 2, 2009). In the context of a regulated claim, the concept is the claim. (page 3)
-------------------------------------

10. Thank you for allowing me to review and comment on this paper. This is a clear and highly readable description of content validity, and will be of great practical value.  Just the explanation of saturation is very useful. 

My only question is on page 3, the third bullet: 
•“Concept” is “an abstract or generic idea generalized from particular instances” (http://www.merriam-webster.com/dictionary/concept, accessed March 2, 2009).   In the context of a regulated claim, the concept is the claim.
The concept is part of the claim, not the claim itself.  If the claim is that this product will control pain, the concept is pain.  
----------------------------------------

11. I have reviewed the task force report "Patient Reported Outcomes Task Force
on the Use of Existing Instruments & Their Modification".

I have the following comments:

On page 3, line 11, "Instrument" and "PRO measure" are presented as equivalent terms. In fact, an instrument is developed to measure certain PRO, and the PRO measure should be representative of the universe of content that may describe the concept it intends to measure. The question in then whether a PRO measure obtained with a new instrument is representative of the concept it intends to measure.

The case examples presented on pages 12 to 15 could be improved presenting real cases. The scenarios present examples of threats to content validity, but not examples of how a content validity should be developed.  Authors should present published cases of content validity, including specific references. 
-------------------------

12.  Please find comments on the draft report which overall looks very good.  While comprehensive, there could be more discussion on recall period and response options if response options and recall period in the existing instrument do not meet the needs of the new project.

page 2

  • “Instrument” means the instrument as a whole, including the content of the items, response scale, instructions to respondents (including recall period and procedures for administration), and scoring (including domains or subscales and total scale scores); and
  • “Concept” is “an abstract or generic idea generalized from particular instances” (http://www.merriam-webster.com/dictionary/concept, accessed March 2, 2009). In the context of a regulated claim, the concept is the claim.
  • “Representative” indicates that the instrument adequately samples the specified domain of content (2).

 Content validity addresses the issue of the adequacy of the instrument to measure what it purports to measure. The classic text, Psychometric Theory, by Jum Nunnally (2), notes that there are two major standards for ensuring content validity. The first is the representative nature of the collection of items comprising the instrument. Because “random sampling” is not possible, the method used to identify and select the items to represent the concept must be explicit. The second, related standard is the use of “sensible” methods of instrument construction (2), that is, the rigor with which the instrument is constructed or formulated, including item wording, response options (e.g., 
dichotomous, Likert, visual analog), and organization (1). The appropriateness of a given content domain is related to the specific inferences to be made from the instrument scores (1).

PRO instruments are designed to capture data related to the health experiences of individuals, specifically the way patients feel or function in relationship to their condition or disease and/or their treatment. Figure 1 depicts the relationship between disease attributes, including observed signs and laboratory values and patient-reported manifestations of the condition, and patient experiences, including their descriptions of the disease attributes and human experiences unrelated to the disease. For PRO instruments, the focus of content validity (the “universe” from which instrument content is sampled) is represented by the intersection of the disease or condition and the patient’s experience, both in relationship to the concept of interest. The focus of this paper is on measures to support evidence of treatment efficacy that only patients can provide. The adequacy of content validity for measures to assess adverse treatment impact, however, would follow the same principles.

In summary, the relationship between the intended measurement concept and the methods used to develop and select items and evaluate the content and formulation of the instrument in the population of interest are the core of content validity. A detailed description of these methods and results is evidence that the proposed use and interpretation of scores from the instrument represent the intended concept, and therefore, the instrument possesses content validity (1-5). In the context of a PRO instrument to support a claim, there are three critical elements to content validity: identifying and defining the concept of interest; understanding the intersection of the disease attributes and patient experience; and documenting the methods used to develop the instrument to capture and quantify the concept.

Page 3:  If a conceptual framework of the instrument is available, it is examined for consistency with the concept. A conceptual framework is a detailed description or diagram of the relationship among the concepts, domains, and items comprising the instrument (10, 11). If such a framework does not exist, one should be developed, showing how the items, subscales, and total scales are related to one another and to the underlying concept and claim. The names used to describe the concept and subscales should be critically evaluated in light of the content and structure of the items, TPP and the target PRO claim. Adjustments in the name or concept referenced in the PRO instrument may need to be made from the original in order to more accurately reflect the content and link to the claim. Strong 
and clear links between item content, subscale names, concept names, study objectives, and target labeling language, are easier to understand, interpret and communicate.

Page 4: Data from focus groups and/or 1:1 interviews with patients form the basis of a PRO instrument. When evaluating candidate instruments, users should examine the data collection methods used to generate the instrument in order to understand the content validity of the measure. Gathering qualitative data through focus groups is both a rigorous scientific method requiring a well-defined protocol and an art requiring a trained and experienced focus group moderator. To assure representation from all group participants and to assist in data analyses, moderators are trained to engage rapport and elicit comments from all group participants without leading or directing participants. An assistant moderator takes notes with participant initials and key words or quotations to facilitate data transcription and analyses. This individual can also map the discussion, marking the frequency with which various participants contribute comments to the discussion to alert the moderator for the need to query certain participants who have been less active in the discussion.

One-on-one interviews are a second type of qualitative methodology that may be used to elicit information during the instrument development process. This approach is particularly effective for sensitive topics unsuited to group discussion or for patient populations unable to participate or uncomfortable in a group setting. One-on-one interviews are also used for cognitive debriefings in which patients review an existing instrument or item pool and provide the developer with insight into the extent to which their language and interpretations match the intent of the items and any critical content that has been omitted from the measure. A combination of focus group and 1:1 interviews can increase confidence that saturation has been reached. Focus groups, interviews, and cognitive debriefing interviews can also be used to evaluate the

Page 5: Cognitive debriefing interviews can be performed with patients from the target population to evaluate patient understanding of the items relative to the concept of interest. These can complement elicitation focus groups or interviews, providing additional evidence of content validity or serve as independent confirmation of content validity. An example of the latter case would be situations in which the instrument development process was consistent with the FDA Draft PRO Guidance (6), documentation appears to be sufficient for submission, and the cognitive interviews are performed to provide additional evidence of content validity for the specific purpose in mind. Cognitive interviews can also provide an opportunity to query patients about the comprehensiveness of the tool relative to their experiences with the concept of interest. Specifically, at the end of the interview, patients may be asked if there were any aspects of the concept, e.g., experiences, symptoms or sensations that were not addressed in the instrument, and if so, how important these are to the concept. If missing themes emerge across multiple interviews and these themes are clearly related to the underlying concept, it is likely the instrument is missing important content and should be modified. This finding is referred to as construct underrepresentation (1).

 Analyses of focus groups and interviews to evaluate and document the content validity of an existing measure are similar to those used in instrument development, identifying themes that emerge from the data in relationship to the concept of interest. These themes are used as analytical codes that are then mapped to the existing instrument content and words and phrases are compared with the wording used in the measure. Choosing specific verbatim patient quotes that are representative of a code can be very helpful in ensuring that all ideas are included and appropriate language is used to construct items and domains.

Page 6: Saturation
Qualitative data should be gathered to the point of saturation to ensure that the items in an instrument appropriately represent the relevant “universe of content” for the concept. Saturation refers to the point in the data collection process when no new concept-relevant information is being elicited from individual interviews or focus groups, determined through serial analysis of data. There is no fixed rule on the sample size needed to reach saturation. However, it is determined, to some extent, by the number of important variables in the target population, e.g., men/women, age groups/geographical distribution, ethnicity). Evidence is based on empirical observation, where no new concepts or codes emerge after the interview or focus group (10, 14). Saturation can be evaluated and documented through a saturation table structured to show the elicitation of information by successive focus group or interview (individual or by set), organized by concept code. For practical purposes of budgeting projects, it is not uncommon to set a sample size of 20-30 interviews or approximately 10 patients representing each variable, even though saturation may occur earlier than the nth interview. Saturation is then documented for where it occurs in the process, often during the interviewing process or sometimes at the end of all interviews.

 

  • 4. Modification of the Original PRO Instrument.

Modifications to an instrument may include: changes in wording or content; change in response options or recall period, changes in mode of administration; translation and cultural adaptation; and application to a different patient population (10, 15).

Page 7: In Scenario A, a PRO measure was developed and tested to assess dyspnea in COPD and is used routinely by pulmonologists in clinical practice. The question here is whether the tool could be used to evaluate treatment efficacy in clinical trials either involving a new patient or target population, i.e., asthma. In this example, the first threat is that there may be aspects of the concept of “dyspnea” that are experienced uniquely by asthma patients and are not addressed in the instrument. A review of the literature and discussions with the developer may uncover qualitative study reports in which data from patients with asthma are presented as part of the instrument development process or by others interested in using the tool in asthma. The review might also uncover independent qualitative studies examining the concept of dyspnea as experienced by patients with asthma, preferably similar to those to be enrolled in the trials, and results that map to the instrument contents

Page 8: VII. Conclusions

 Content validity refers to the extent to which an instrument contains the relevant and important aspects of the concept(s) it intends to measure in the patient population of interest. This paper discussed the key issues involved in assessing and documenting the content validity of an existing measure, including concept clarification, instrument identification and initial review, as well as qualitative methods. Case examples were used to discuss threats to content validity and various approaches for remediating these threats. Several tools were identified to aid in the evaluation of content validity, including endpoint models that describe the correspondence between concepts, measures and labeling goals; the PRO instrument’s conceptual framework to evaluate and communicate the conceptual match between item content and concepts; and qualitative research methodology that forms the empirical basis for decision-making and documents the methods used to support content validity.
-----------------------------

 13. We would like to thank the Task Force for developing this paper, which will serve as a useful guide to health care manufacturers.  Our comments are below.

  • Among possible PRO endpoints targeted for label claims, there is an important distinction between “HRQoL” scales, such as physical and social functioning, versus symptom scales, which may measure variable and intermittent events.  It would be helpful to provide detail on assessing content validity of symptom measures.  The paper states that frequency, intensity, and duration are attributes of symptoms, and those attributes that patients identify as important should be measured for all symptoms.  However, it seems that if a symptom is considered important, then all of these attributes should be considered for measurement if they are applicable in comprehensively capturing the symptom, and the symptom is being targeted for a label claim. 

Conceptual model development for symptoms has included two key components:  the occurrence of the symptom and individual’s emotional response to it, i.e., distress.  This leads to a general question about content validity in symptom measurement:  If the symptom is deemed important, then should ‘bothersomeness’ of the symptom [or alternative item(s) that focus on symptom impact] be included in the symptom measure?   

1a)  Perhaps the example for the #3 threat to validity (no evidence that the most relevant and important item content is contained in the instrument) that refers to symptom attributes should be replaced?  Given that frequency, intensity, and duration, are key features associated with symptoms, they will be important depending on the nature of the symptom itself, and would not be ranked with differential importance ratings from the patient perspective.  Specifically, if both intensity and duration apply for a symptom, a patient would not say that one is important but not the other.      

  • On page 8, there are two statements that ‘a combination of focus groups and 1:1 interviews can increase confidence that saturation has been reached’.  It is not clear here whether or not the 1:1 interviews referred to here are ‘cognitive debriefing interviews’ or 1:1 open-ended interviews discussing patient experience.  If the statements are referring to 1:1 open-ended interviews, then it suggests that one obtains different information from focus groups versus 1:1 interviews (open-ended), and this has not been our experience.  Nor do we know of any evidence that this is the case.   It would be useful to clarify this because companies who review this paper may believe that they need to do both focus groups and 1:1 interviews to obtain information on patient perceptions/experience to increase confidence that saturation has been reached with respect to identification of concepts.      
  • On page 7, it is noted that empirical methods, specifically focus groups and 1:1 interviews, are used to elicit information from patients to inform instrument development.  Should something be said regarding patient-based Web forums to help inform content, particularly for rare conditions? 
  • On page 10, it is noted that, for practical purposes of budgeting projects, it is not uncommon to set a sample size of 20-30 interviews, even though saturation may occur earlier than the nth interview. It may be useful to reference:  Guest G, Bunce A, Johnson L.  How many interviews are enough?  An experiment with data saturation and variability.  Field Methods 2006:18;59.  Guest and colleagues assessed information saturation based on 1:1 interviews and concluded that 12 were sufficient to attain saturation (this likely translates into approximately 3-4 focus groups, based on our experience).   
  • The paper explains that the endpoint model is useful as a first step in constructing the TPP.  We believe that it is actually the reverse of this.  The TPP presents the hypothesized treatment effects, and this profile is then used to inform the target endpoints in the trial, from which the endpoint model is developed.
  • A note regarding the comment that cognitive debriefing interviews can provide an opportunity to query patients about the comprehensives of a tool; at the end of the interview, patients may be asked if there were any aspects of the concept, e.g., experiences, symptoms or sensations that were not addressed in the instrument, and if so, how important these are to the concept.  In our experience, cognitive debriefing interviews can be used to confirm existing content but they typically do not yield new relevant concepts.  
  • With respect to saturation, it may be useful to clarify that the saturation described refers to that attained through evaluation of open-ended focus groups or 1:1 interviews.  It may be useful to discuss how this evaluation of saturation differs from that conducted for cognitive debriefing interviews, in which the objective is to identify item-level problems, not to identify new concepts. 
  • There appears to be an error with respect to the dark/light colors in Figure 3.     
    ------------------------------------------------------- 

14. Many thanks for the opportunity to review this manuscript. Content validity is such a fundamental element to the development and validation of PROs that such a manuscript providing greater clarity to this area is important.  We have reviewed the manuscript within a group of PRO experts working in Outcomes Research here at Pfizer (listed below) and I have summarised collated comments in the email below.

Collated comments from: 5 Pfizer associates

OVERALL MANUSCRIPT:
As PRO experts heavily involved in PRO development we found the paper interesting, however for an audience that is not so familiar with the topic a more practical guide would be very much needed and there are several lose terms in the paper that could cause confusion (e.g. the definition of "concept" on page 3).  It would therefore be useful to clarify the intended audience for the manuscript.  Greater clarity around the practicalities of content validity could be built very easily into the manuscript by adding more specific examples (for example in the recommendations and the tables and figures) to complement the theoretical discussion nicely.  It would be useful to both novices and PRO experts alike if the recommendations in the paper could be more specific about particular points, although we realise that it is difficult to be too prescriptive, some ranges or examples as guides would be informative (e.g. suggesting a range in proportion of patients identifying a theme to indicate construct under representation (page 9 first para)).

There is a great deal of repetition in the manuscript, reducing this and focusing the paper would make it very much more readable to the audience.

Greater use of language that is consistent with the draft guidance would help to clarify certain points.  Making reference to conceptual frameworks (e.g. in section II), and adding in citation references to the guidance for concepts such as the endpoint model and TPP, will help to consolidate understanding in the area and highlight consistency across the field.

INTRODUCTION:
It would be useful to many readers to begin with a simple definition of content validity up front in the introduction.  Making reference to Laurie's previous statements in her presentations (or at least this concept) regarding the fact that no amount of psychometric validation will make up for a lack of content validity and highlighting the reasons for this being integral to PRO development also at the beginning would serve to emphasise the importance of this topic (and therefore paper) to the reader from the start. 

SECTION IV:
Explicitly stating the need for and benefits of consulting with the original developer of an existing instrument would be useful here.  Not everyone is aware of the need for to obtain permissions for some measures or of the insights that original developers and perhaps other KOLs can bring throughout the process.  This is also relevant to the point raised re considering alternative naming conventions for existing measures (SECTION V page 9, 2nd para), which could be problematic if a measure has been used and validated previously.  The potential that some developers may not being willing to share data is also worth mentioning in the discussion on case examples (SECTION VI, page 13, 2nd and 4th para).

SECTION V:
Data collection section (page 8) - Discuss the importance of the discussion guide to ensure that all of the relevant topics are covered and the need to minimise bias by starting with general topics and then focusing on salient points for the research.

Page 8 - 3rd para - the definition of saturation is provided at the end of this para, but would be useful at the end of the 2nd where saturation is also discussed.

Page 9 - concept of construct under representation - linking this to the concept of "fit for purpose" would be useful here.  A conceptual model that underlies PRO tools may always have the potential to be considered an under representation in the strictest sense as it moves away from the concrete experiences to abstract concepts.  There will always be individual aspects to some experiences, for example the language used to describe an experience varies individually, culturally and linguistically.  It is not necessarily possible to achieve a totally comprehensive representation for all patients everywhere all of the time.  The aim should rather be a measure that captures the most salient points, is accepted and understood by the patients who are using it, and also that is judged to be acceptable by scientific communities and regulators as being fit for the intended purpose.  Given this, it would be more useful to talk of acceptable or unacceptable coverage of patient experiences in conceptual models and then give some broad ranges of what may be examples of those.  This issue of relevance of differences is discussed in scenario C for the case studies, which is very important, and also in section VI (point 3, page 12), however it would also be useful if it was referred to in these other discussions within the text of the data analysis section and the section on saturation (page 10).

Data analysis section (page 9) - suggest adding in a statement to the effect that qualitative data analysis should ideally be done on an electronic package such as....  This would enable good record keeping for those doing the analysis, but also greatly facilitate a review of this work by others if this occasion were to arise (such as regulatory review).

Saturation section (page 10) - add in concept of fit for purpose.  Also clarify the consistency with guidance suggesting that interviews should be conducted sequentially for saturation analysis and that repetition of interviews/focus groups is needed to provide comprehensive evidence of saturation.

 TABLE 1
Point 3 - suggest rephrasing to focus attention on what we are trying to do e.g. Identify possible PRO measures to evaluate the concept Point 7 - make it explicit that this is done by reviewing the findings of the qualitative research Point 8 - bring in the concept of whether the measure is "fit for purpose"
Point 8b - is changing the concept/claim to suit the instrument really what we want to advocate?

Also, a couple of typos:
Running title: Should this read Content validity of existing PRO instruments (rather than PRO existing instruments) Introduction, page 2, line 4 - S missing out of ISPOR Page 4, last para - the section # is missing off the title for section III Table 2 scenario A - approaches to remediation point 2.  The word "Review" is missing in sentence - Do an extensive literature....
Table 2 Scenario C - last sentence, first para - "language" is missing - in the language used by some patients...
Table 2 Scenario C - issues/threats point 3 - should begin with "Are" rather than "Do"

Thanks again for the opportunity to review.  This manuscript addresses an important topic and I hope that these comments are useful.
--------------------

15.  Feedback on: “Good Research Practices for Evaluating and Documenting Content Validity for the use of Existing PRO Instruments and Their Modification.”
This is a superb document containing a very clear definition and explanation of “Content Validity” encompassing the definitions of “intent”, “instrument”, “concept,” “representative.” It succeeds in providing all-inclusive guidelines for tool development and modification. The steps in development and modification are succinct, unambiguous and easy to follow. The use of the case examples provides excellent practical guidelines to help evaluate PRO content validity.
These should prove invaluable to tool developers.
There are very few modifications needed, apart from some edits and possible minor additions.

Edits: On page 2, In the introduction, line 4, IPOR should probably be ISPOR?
On Page 4, in the paragraph beginning “In summary…….” in the sentence about the three critical elements to content validity, I suggest that “understanding” be changed to more acceptable “hard” objective “recognizing.”
Table 1. I did not see it referenced in the text. The table itself is excellent and mirrors the guidelines in the text.
In Table 2: Scenario A. Under “Approaches to remediation”,  #2. Do an extensive literature..the word “search” to determine…… has been omitted.
Scenario C, the last sentence in the first paragraph should read, there do seem to be some differences, however, in the words/descriptors used by some ……………
Change “does” to “do”, and include “words/descriptors” to correct the omission.
Also in Scenario C, under Issues/threats to content validity, I suggest including after 2:
“3. If several patients use different words, which imply a different meaning, could this imply a misunderstanding?”
Previous # 3 could become #4.
Figure 1. Under disease attributes…
Instead of “Observed and Laboratory”, I suggest
Results of:
Clinical examination
Laboratory tests
Electrophysiological tests
Scans, X-rays
Figure 2.
“Generated words and Phrases” should include “from Patients”
In this figure, there seem to be some missing items:

  • Input from a clinical team of specialist comprising a Delphi panel
  • Ease of reading
  • Translatability

Maybe 1) could be included under Concept Elicitation and 2) and 3) under Interpretation and Meaning.
--------------------

16. This was one of the best task force reports that I have read.  I actually enjoyed reading it and found that I learned a lot from it even though I thought I knew a great deal about the subject.  The case studies were excellent examples of situations that are likely to occur and give practical recommendations for how to address them.

I had mostly minor comments on confusing sentences and suggestions for moving some concepts earlier in the paper because they are so important.  See attached file for edits and comments.

It will be great to have this report published as it will definitely contribute to the field.

a. abstract and first paragraph are almost the same.  Should they be more differentiated?

b. p 3 Is it necessary to define universe?  Isn't that a recurring issue that some measures don't cover everything in the concept? I would include it as a key part of the range of what the content is supposed to cover.

“Concept” is “an abstract or generic idea generalized from particular instances” (http://www.merriam-webster.com/dictionary/concept, accessed March 2, 2009). In the context of a regulated claim, the concept is the claim.
• “Representative” indicates that the instrument adequately samples the specified domain of content (2).
Content validity addresses the issue of the adequacy of the instrument to measure what it purports to measure.

c. p 4  The wording "to develop...to capture" might be confusing because it could be interpreted as meaning that the instrument was developed to capture and quantify the concept when it the methods that do this.  Slight rephrasing might reduce the ambiguity: documenting the methods used in developing the instrument in order to capture and quantify the concept.

In summary, the relationship between the intended measurement concept and the methods used to develop and select items and evaluate the content and formulation of the instrument are the core of content validity, with a detailed description of these methods and results used as evidence that the proposed use and interpretation of scores from the instrument represent the intended concept, and therefore, possess content validity (1-5). In the context of a PRO instrument to support a claim, there are three critical elements to content validity: identifying and defining the concept of interest; understanding the intersection of the disease attributes and patient experience; and documenting the methods used to develop the instrument to capture and quantify the concept.

d. p 5. I found this confusing, especially the placement of from the original because it refers back to adjustments at the beginning of the sentence. Suggest trying to streamline the sentence, or at least the first part.  Perhaps this is clearer:

The name or concept may need to be adjusted from the original PRO instrument in order to more accurately...

Also I think an example might clarify it too.  Maybe example of Fatigue people the title often used when patients would typically say tiredness or exhaustion.

 If a conceptual framework of the instrument is available, it is examined for consistency with the concept. A conceptual framework is a detailed description or diagram of the relationship among the concepts, domains, and items comprising the instrument (10, 11). If such a framework does not exist, one should be developed, showing how the items, subscales, and total scales are related to one another and to the underlying concept and claim. The names used to describe the concept and subscales should be critically evaluated in light of the content and structure of the items, TPP and the target PRO claim. Adjustments in the name or concept referenced in the PRO instrument may need to be made from the original in order to more accurately reflect the content and link to the claim.

e.  p. 6  something is strange here, methods are the essence?  Could we say documenting in place of its?  That way it means that the methods are the essence of documenting content validity. 

The methods used to develop a PRO instrument are the essence of its content validity for a given purpose. A complete understanding of these methods through information available in the published literature and other documentation is essential in order to evaluate the suitability of an existing PRO instrument for any purpose

f. p. 6 is this meant to modify modifications?  I am not sure it is clear what type of modifications these are.

Because qualitative methods are essential to selecting and documenting the content validity of an existing instrument and to performing content valid modifications if necessary, the following section provides an overview of key aspects of the methods particularly relevant to evaluating existing PRO measures to support regulated claims.

g. P8 in reading this I realize my interpretation of figure 2 might have been off.  It is meant to represent individual interviews during item generation or this cognitive debriefing process of reviewing an instrument?  I was thinking of the latter in my comment on figure 2.

One-on-one interviews are also used for cognitive debriefings in which patients review an existing instrument or item pool and provide the developer with insight into the extent to which their interpretations match the intent of the items and any critical content that has been omitted from the measure. A combination of focus group and 1:1 interviews can increase confidence that saturation has been reached.

h. P9 This is the third time saturation is mentioned. maybe it should be defined earlier in the paper.

Coding transcripts by participant, using initials or other coding system to protect anonymity, allows the researchers and reviewers to evaluate the representativeness of content across participants and provides assurance of saturation, (defined and discussed below).

i. p10  consider this earlier in the paper as it comes up a number of times prior to this point.
Saturation
Qualitative data should be gathered to the point of saturation to ensure that the items in an instrument appropriately represent the relevant “universe of content” for the concept.

J. p13. this is a good point but it lacks coherence with the previous sentence that talks about changing from clinical practice.  There needs to be more of a transition to the next sentence.

It is not uncommon for clinical tools to be developed using clinician expert opinion with content that addresses the information needs of the practice setting. For example, if the tool was developed using qualitative research methods with direct input from patients with asthma similar to those to be enrolled in the clinical trial and results of this work is available, the magnitude of the threat declines

K.  p13 Need to add that such modifications would require additional psychometric validation of the instrument in the new target population as well, if you agree it is needed.  I imagine if items are added to address patient input, then it is technically a new instrument. 

This would enable the user to evaluate concept coverage and, if adequate, provide data for documenting the relationship between patient data and instrument content in the target population. If the content is found to be inadequate, this process would provide the sponsor with an opportunity to modify the instrument, with the potential for increasing the sensitivity of the instrument to detect treatment effect in this patient population.

L. p 14 this issue of sampling from the universe of possible content is really important.  I think it should be mentioned earlier and expanded upon as this is a recurring issue, what is enough and do you have to include everything patients say to have content validity.  I think one of the benefits of this paper is to clarify points like this.

It is important to note that content validity involves the adequate sampling of content from the universe of all possible content to measure the concept of interest.

M. p 14-15
need more explanation of why using different words or phrases is a threat to content validity, and therefore why cognitive interviews would be important.

To be certain, the sponsor elects to verify content validity of the instrument in a small number of patients using focus groups. From these groups, the sponsor learns that the concept(s) is relevant and the item 15 content reflects the full range of experience described by the patients. However, patients in the study use different words or phrases than those used in the instrument.

---------------

17. My opinion regarding the document:
 
Comprehensive  document, excellent tabular presentations.

Systematic introduction with clear explanations of content validity-core point of PRO concept identification.

Elegant connection-Concept Identification and Labeling Context.
Methodology approach systematic and precise-sampling, data collections.
Importance of cognitive debriefing very well expressed.

Data analysis-Matching representation original.
Perhaps to add more about mathematical  approach regarding data analysis with the examples.

Term " saturation" sympathetic but I am wondering if it is the best  definition  to explain "if the items in an instrument appropriately represent the relevant universe of content for the concept".
Might be, perhaps, realisation point-more open than saturation that is closed in its dimension-because "there is no fixed rule on the sample size needed to reach saturation".

Modifications of the Original PRO instrument-nice case presentations.

Compliments for the Table 1

Figure 1-Content Validity-perhaps to add with Observed Lab also observed physicians disease attributes description and measurement!

Once again kind regards to you and compliments to ISPOR PRO Task force
-----------------------------------

18. Thank you for sharing this great document.  Although it was a bit theoretical to me (a statistician), I found it was very fun to read. I don't have any major comments or suggestions.  Followings are my thoughts.

 I agree that a qualitative analysis approach is instrumental.  However, we also use a quantitative approach to test content validity.  

i) For instance, a Content Validity Index for each item can be calculated to justify to include.
I think people love to have some number to make a judgment.

ii) "Eliminating redundant items is expected to leave the intended measurement concept intact. In this case scenario, qualitative evaluation is recommended, even if results of ---  (page 14)

I agree that qualitative evaluation would be good too, but we frequently (almost always) conduct a factor analysis to see if all loadings are significant to the latent factor. 

19. 1) It would be useful to provide an example of a trial setting and existing instrument that could be used without modification. 

2) Part of the process is to do a gap analysis based on the PRO evidence dossier outline to identify strengths and weaknesses of the existing instrument.  That could be made more explicit.

3) Even though the paper was about content validity only, the next steps that would need to be done to use the existing instrument following the additional validation work should be outlined. For example, does the modified instrument need to be included in a phase 2 study before using it for a potential label claim in phase 3?
---------------------------------------

20.

  • In the abstract, the first sentence should perhaps be more general, e.g., “PRO instruments are important in evaluating how patients feel or function.”
  • Throughout, the use of the term “address” in the phrase “to address good clinical research practices” may strike readers as strange. This may be the stated reason for convening the task force, but “address” sounds like GCRP are a problem that needed to be addressed. Consider “establish”, “develop”, or “identify”.
  • Introduction, first sentence: again, consider going a bit more general by changing the word “efficacy” to “effects”. Your focus on efficacy comes later. You may even want to add that clinical research has begun to expand beyond biomarkers to recognize patient report as the gold standard for many outcomes important to patients’ well-being (and add citation/s to this effect).
  • Consider removing the last sentence of the first paragraph of the intro, to avoid repetitiveness. May leave as is, or add a last sentence narrowing to the context of regulated claims.
  • 3rd bullet under II, the sentence “In the context of a regulated claim, the concept is the claim.” – it is difficult to see what this means and it needs more elaboration. Product claims are statements about efficacy generalized from trends in aggregate data, not particular instances. Consider moving to later and explaining in greater depth. Also, the whole format in this bulleted section of defining the terms “intent”, “instrument”, “concept” and “representative” – even before you use them – is a convention I most commonly see in legal documents I need to sign or terms and conditions I need to agree to. This style issue affects readability and may not convey the tone you want to set. Consider integrating definitions into first use of each term later.
  • Speaking of readability, the paper needs something early on to draw the reader in to the ideas presented. Who do you want to do what as a result of the ideas you are presenting? Recounting that the task force decided to focus the paper on X is not compelling. Maybe something like: “These best practices will provide guidance for PRO instrument selection and modification in the context of regulatory claims. Our objective is to provide industry and contract researchers with a set of practices to implement from the start of their studies, in order to improve the research behind labeling claims.” – that may be a little off here or there, but something like that would help rustle up enthusiasm.
  • 1st sentence under Concept Identification within the Labeling Context: typo, “it’s”
  • Move last sentence of that paragraph to become the new 2nd sentence of that same paragraph.
  • Section V, Qualitative Methods, 1st sentence following the quote – I would change the word “empirical” to “observational” just to be more specific since empirical means derived from experiment or observation.
  • 1st paragraph under Data Collection Methods, consider: “…and an art that depends on group dynamics and on the training, experience, and people (or interpersonal) skills of the focus group moderator” – because I don’t think training and experience are enough, and even a great moderator’s success in garnering good information depends on the group dynamic.
  • 2nd paragraph under Data Collection Methods – after the term “saturation”, may want to add “(the point in qualitative analysis at which no new information is added by new content, discussed more in-depth later on).”
  • Ah, there it is at the end of the next paragraph – maybe move earlier
  • 2nd sentence of the 4th paragraph under Data Collection Methods “These interviews can complement information elicited in focus groups, providing additional evidence of content validity. They may also serve as independent confirmation of content validity.”
  • 3rd sentence of the 1st paragraph  under Data Analysis – “Coding transcripts by participant unique identifier or disease characteristic allows researchers…”. The mention of anonymity distracts from the main concept and is really a whole separate issue, since recoding names to initials is not the only thing done to protect patients from privacy violations.
  • Figure 3: it could be made more illustrative by adding example concepts for each scenario.

Text Box: Instrument content               Elicited content 

http://www.ispor.org/TaskForces/EIM_background_clip_image002_0000.gif  

  • As far as benchmarks for what percentage of content matching indicates the quality of the match (e.g., less than 30% being bad match), it would be good to check for a reference on this – and maybe to note that the percentage of matching content depends greatly on the size and number of coded segments of content, and the prospect of non-mutual exclusivity among the concepts. In the example above, the match would be even worse if “ability to dress” were segmented in the instrument to “ability to put on shirt” and “ability to put on pants”. On the right of the diagram, if meal preparation were coded under light housework, it could be made to look as though only one area of the elicited content was missing from the instrument. These grey areas merit some discussion.
  • Table 1 is very useful but could be even more useful perhaps if references were provided to relevant texts, for example each of these texts provides guidance on sample size for focus groups – (but each one provides a different ideal focus group size!)
  • Krueger, R. (1994). Focus groups: A practical guide for applied research (2nd ed). Thousand Oaks, CA: Sage.
  • Morgan, D. (1988). Focus groups as qualitative research. Qualitative research methods (Vol 16). Newbury Park: Sage.  
  • Stewart, D., Shamdasani, P., et al. (2007). Focus groups: Theory and practice. Thousand Oaks: Sage.
  •  Bender, D., & Ewbank, D. (1994). The focus group as a tool for health research: Issues in design and analysis. Health Transition Review, 4(1), 63–79.

Also there are some seminal texts in cognitive interview methods.
-----------------------------------

21.  General comments
This very timely commentary is a much needed contribution to the growing literature on the use of PRO measures to support medical product labeling claims. On the whole, I have found this a very well written and thoughtful report, offering answers to many of the questions relating to the evidence required to support decisions about the content validity of PRO measures. I have only a few minor comments to make, which are intended in support of the paper and only as suggestions to improve its usability.

Minor comments

  • Page 4: Section III “Concept identification within the labeling context” is not numbered.
  • Page 5, Section IV, 1st para: Reference to Draft PRO Guidance needed at end of paragraph.
  • Page 5, Section IV, 2nd para: the authors may wish to include mention of the instrument name itself in the sentence the following sentence (shown with addition underlined): “The names used to describe the concept, instrument and subscales should be…”. Very often the name of the instrument is also misleading, not just the subscale names. I think this is an important point, alluded to by Polonsky (2000) and in my editorial (Speight and Shaw, 2008), as researchers very often take the name of the instrument to be indicative of its content, and thereby implying the suitability of its content validity. In our recent review of diabetes instruments (Speight et al, 2009), we have provided a critical evaluation of several instruments, among them the EuroQoL EQ-5D, which many people assume to measure quality of life rather than health status by virtue of its name.
  • Page 6, Section IV, 3rd para, line 5: the authors indicate that “an instrument may be appropriate if… a sample similar t the development program’s target population was used”. How would the authors define “similar” and what tools can be used to assess the extent to which sufficient similarity is achieved. It might be useful to include a table in which the first column includes “criteria” (such as age, gender, ethnicity, treatment type etc) and the second column includes either the “standards” to be achieved or the issues to be considered (if setting standards seems beyond the authors’ remit). I believe that some guidance on issues of what constitutes similar and dissimilar populations is much needed and would add even greater value to this report.
  • Page 6, Section IV, 4th para, line 3: change sentence to read “In these cases, there is little empirical evidence from which the sponsor or reviewer can make…”
  • Pages 7-10: It would be helpful if sub-headings of Section V were numbered, e.g. “V.i Sample”, V.ii Data Collection Methods”.
  • Page 8, Section V.ii Data Collection Methods, 2nd para, final line: change sentence to include the following underlined phrasing “…can increase confidence that saturation (defined and discussed in Section V.iv below) has been reached”. Otherwise, the reader is left wondering whether this is going to e discussed in further detail or not.
  • Page 9, Section V.iii Data Analysis, 1st para, line 5: change sentence to read “…defined and discussed in Section V.iv below)”.
  • Page 10, Section V.iv Saturation, 1st para: Towards the end of the paragraph, the authors may wish to include mention of the need for larger samples when stratification of the sample is needed to ensure that saturation has been reached within each sub-sample.
  • Page 11, Section VI, bullet 2, final line: define “important” perhaps?
  • Page 12, Case examples: should that be Section VII?
  • Page 13, Case examples, 3rd para: What about the instrument recall period? It has been mentioned earlier that the recall period may be suitable for a research study or clinical use but not necessarily suitable for use in a clinical trial to support a label claim. If a change in the recall period if needed to mae the instrument suitable for use to achieve a label claim, what level of evidence is needed to support that change in the instrument? Is full psychometric validation required? Can cognitive debriefing confirm the suitability of the new recall period? It would be useful if the authors could offer guidance on this or reference guidance elsewhere in the literature.
  • Page 16, References: For refs 6 and 7, the FDA as the author does not appear to be properly referenced.

References
Polonsky WH (2000) "Understanding and Assessing Diabetes-Specific Quality of Life." Diabetes Spectrum 13(1): 36.
Speight J and JAM Shaw (2007) "Does one size really fit all? Only by considering individual preferences and priorities will the true impact of insulin pump therapy on quality of life be determined." Diabetic Medicine 24(7): 693-95.
Speight J, Reaney MD, Barnard KD (2009) Not all roads lead to Rome – a review of quality of life measurement in diabetes. Diabetic Medicine, 26(4), 315-327.

------------------------------------
NON – ISSUE COMMENTS

22. I have read the draft manuscript and have no comments.  The approach taken and the recommendation are practical and appropriate and the three examples are a useful way to make concrete the issues involved.  Thank you for sending me this material.

23. Not further comments

24. Comments:  Excellent paper!!!

25. I have no comments

PRO Task Force: Use of Existing Instruments and their Modification | Task Forces Index