The issue of missing data is a major challenge in
economic evaluations, often impacting a
researcher's ability to draw valid and conclusive
inferences. Reliance on subsets of “complete”
information or incorrect use of imputation techniques
compounds the problem. Lieven
Annemans and Dan Ollendorf report on some of
the issues and solutions discussed in a recent
IMS symposium, “Handling Missing Data in
Economic Evaluations,” held at the ISPOR 11th
Annual International Meeting in Philadelphia, PA,
USA, May 2006. Speaking at this symposium
was Dr. John R. Cook, Director of Health
Economic Statistics at Merck; offering potential
solutions was Mr. Daniel A. Ollendorf, Vice
President, Applied Research at PharMetrics (a
unit of IMS).
Health economic evaluations typically draw on
longitudinal information - from prospective randomised
clinical trials, retrospective databases,
or medical file analyses. Any gaps in the information
available thus present a major challenge
for the successful completion of a meaningful
evaluation that is based on valid assumptions.
An Inevitable Phenomenon
To a certain extent, missing data are an inevitable
issue in economic evaluations. A feature of both
prospective analyses (due to patient dropouts)
and retrospective evaluations (due to the nature
of databases) alike, missing information can
occur in all types of data collection, from postal
surveys to rigorously controlled clinical trials.
The key challenge is to maximise usage of
data that are available while minimising the bias
introduced by the elements that are missing.
Why Does Data Go Missing?
(The term “missing data” essentially refers to
observations which were intended to be collected
but which for one reason or another were not.)
The nature of missing information is variable. In
patient-completed surveys and prospectively
controlled trials, for example, data are typically
lost in one of two ways - systematically or randomly.
This can be illustrated using the title “The
Importance of Handling Missing Data”. A systematic
loss of data might take the form of “Th mprtnc f hndlng mssng dt” where the vowels are
missing in every word. A random loss might
appear as “The mprtac o ndin Missing Data”
reflecting the lack of any specific pattern to the
missing information. The important point is that
in either of these cases, different messages will
This will also be the case with other types of
missing data which include:
Time point missing data: This is typically
seen in patient reported outcome (PRO) studies
or clinical trials due to a missed scheduled visit.
Extending our title analogy, this would lead to ' .
. Importance of Handling . . . Data'.
Censoring: Censoring is fairly common in clinical
trials and also within retrospective databases.
Trials, for example, are often truncated before
patients reach the specified endpoint, and in longitudinal
databases, the starting point for patients
can occur after they have commenced treatment,
leaving researchers without an anchor point.
Similarly, patients can leave databases at varying
points in time leading to gaps. Censoring can
occur at both the beginning ('… of Handling
Missing Data') or at the end of a trial ('The
Importance of Handling …').
Missing data mechanisms are important to
understand, as the necessary adjustments will
differ. We distinguish different mechanisms as
Missing Completely at Random (MCAR): In
cases where the probability of an observation
being missing does not depend on observed or
unobserved measurements, it can be said that the
observation is missing completely at random,
which is often abbreviated to MCAR. An example
of a MCAR mechanism is a laboratory sample
Missing at Random (MAR): With MAR the
deficiency may depend on the observed data. The
idea is that the data will be conditionally MAR. For
example, if the probability that Y is missing conditional
on X (vector of observed variables) and Y,
then the probability that Y is missing depends on
the value of X. For instance, non compliance may
depend on adverse events. Given that adverse
experience, the fact that one person has missing
data versus another person with complete data is
something that is a random phenomenon. In this
case the probability that it is missing only depends
on X, which is the presence or absence of that
particular adverse experience. Conditional upon X
there is a random sub-sample but data are needed
to provide some information on the X's. One of
the problems here noted Dr. Cook, particularly in
retrospective studies, is that the complete
observed data may not be available to allow the
distinction and find that conditional independence.
Not Missing at Random (NMAR) - Noted to be
one of the most problematic kinds of missing
data - this is where the dropout depends upon
the missing and unobserved data. An example might be when a form is sent
out to patients asking them to respond within a set period of time to collect
information on the number of outpatient care visits that they may have utilized.
There may be instances where the patient is in the hospital at that time,
thus the data will not be captured.
An Extensive Problem
There are many instances where insufficient information is available. As
cited in the findings of several recent papers, including a study by Hennessy,
et al, which examined the extent of missing data within the US Medicaid
claims database in six states . This specifically set out to capture information
on prescriptions over time, hospitalizations and differences by age
groups. A key finding was gaps in the prescription information, with some
months showing numerous claims and others far fewer - spikes quite possibly
due to the incomplete collection of data.
Similarly, a comparison of hospitalization records to claims data from CMS
revealed differences from state to state that varied as much as 30 percent
lower with the Medicaid claims data to 210 percent higher than that seen in
the CMS data. Thus, even when the same information is being collected, it
is possible to see where there may be systematic differences or large differences
between the respective claims databases.
Safe to Ignore?
So, what would happen if we simply ignore those patients who have missing
data? As Dr. Cook observed, deleting records with missing values is common
practice for dealing with item non-response, producing a reduced-size
dataset of complete cases in which all the variables are fully observed. This
approach has its advantages: it offers simplicity, since standard statistical
packages can now be easily applied, and comparability, as all calculations
proceed from a common base . List-wise deletion involves discarding all
cases with any missing values, which may be perfectly appropriate in
numerous situations, particularly if the number of deleted incomplete cases
is relatively small or if the deleted cases are very similar to the complete
In many situations, however, discarding incomplete cases is disadvantageous.
Firstly, if the deleted cases differ from the complete cases, estimates
based on complete cases will be biased. Secondly, the precision of model
estimates will be lower due to the smaller sample size.
Documented case studies clearly illustrate the pitfalls of disregarding incomplete
cases. An assessment by Briggs, for example, which examined the
cost of transurethral resection procedures (TURP) or contact laser vaporization,
found that 10 percent of their chosen data points were missing overall
but at least 45% of patients had some form of missing information. Of particular
interest was the finding that excluding individuals with any missing
data suggested that TURP was more expensive than laser treatment.
However, when certain missing data strategies were applied, the reverse was
found to be the case.
The case of the United Kingdom Prospective Diabetes Study (UK-PDS) looking
at average hospital length of stay also underscores the point. While there
was good information on hospitalization overall (around 4,000 patients and
7,500 records of hospitalization for type 2 diabetics) there were gaps on the
length of stay, to the extent that about 16% of the observations had information
missing. Since only those patients who were hospitalised had missing
information, ignoring them would introduce a downward bias in terms of the
estimated total number of hospital days or actual number of hospital days in; the population. And those individuals with multiple hospitalizations would be
even more likely to have missing data which would further pull down the estimate
- and bias it even more.
Before embarking on a research project, it is thus important to do as much
as possible in terms of study design to make sure that the required information
is collected and that the databases used have as much of the available
data as possible. However, with the best will in the world, missing data will
occur, leading to biased estimates of resource use as well as cost and cost
Missing data is significant in compromising the ability to analyze endpoints of
interest. We consider some of the most popular approaches to the problem.
The most commonly used techniques to account for missing information
typically involve its replacement with values derived from non-missing observations
at the cohort level. In the majority of cases, a simple mean or median
replacement is used.
Among the data shown in Table 1, for instance, there is one missing observation
on unit 10, variable
2. This is replaced
with the arithmetic mean
of the observed data for
that variable as shown in
However, this approach
is clearly inappropriate
for data that consist of
only a small number of
values, each corresponding
to a specific category
value or label. Nor
does it lead to proper
estimates of measures of association or regression coefficients - rather, associations tend to be diluted.
Furthermore, if the imputed values are treated as real, variances will typically
be under estimated.
One potential application of using this relatively simple approach is in comparing
reimbursement fields that typically show up in US claims databases.
The economic fields in US claims data fields are generally interrelated, hence
the feasibility of this technique. The “charged” amount represents the
amount billed by the provider or facility for the service. The “allowed” is the
amount that is contractually agreed between the payer and provider/facility.
The “paid” amount is the sum reimbursed by payer net of patient contribution
such as patient co-payment.
Paid amounts, which are critical to any US economic evaluation from the
direct medical cost perspective, may be missing at rates of 5-15% depending
on the payer. This could be attributable to the capping of reimbursement
for that particular payer, i.e. the provider/facility receives a certain amount
that is not subject to any adjustment based on individual service volume,
which means that it will not capture payment for those individual services.
There could also be billing or data entry errors or claim denials so the payment
field is missing. Citing an internal study comparing a simple imputation
of the median ratio of paid to charged amounts to more sophisticated multivariate techniques, simple imputation was found to outperform linear
regression and be essentially comparable to weighted regression when
results were compared to those from a validation sample in which all data
Other approaches to account for missing longitudinal data include baseline
based approaches and patient history-based approaches.
Baseline Based Approaches
Simple Class imputation: Similar to the simple population-based imputation
methods, but within subgroups that are more or less homogenous.
Hot Deck: The hot deck imputation method involves a recipient (the person
with missing data) and a donor (another person with similar characteristics,
whose value is known) . It involves then replacing the recipient's missing
value with the donor's value at the appropriate time point. These class imputations
imply that persons with missing data are a random sample of the persons
in their class . The advantage of the hot deck method is that it is fairly
simple, uses concomitant information and can capture crude missing data
mechanisms. One of hot deck's disadvantages is the difficulty in defining
patient "similarity", as well as in the assumption that a missing data subset
is generalizable to the class.
Iteratively Re-weighted Least-Squares Estimation: The ILRS method is
weighting least-squares estimates with regression residuals from within the
model . As the name suggests, a weighted least squares fit is carried out
inside an iteration loop. For each iteration, a set of weights for the observations
is used in the least squares fit where initial weights are based on residuals
from an initial fit. The ILRS can use both weighted and un-weighted least
The advantage is that the computations are more traceable than other residual
methods and the statistical properties easier to obtain. However, on the
downside is the fact that weights are used to influence how each imputed
value affects model parameters, where there is an implicit assumption that
the scientist knows which values are likely to be high versus low quality (or,
high vs. low variability); Secondly, the missing data is expected to follow a
particular form that can guide the weighting and re-weighting process.
Where good information does not exist on either of these two points, the
exercise will be no better than simple methods.
Patient History-Based Approaches
Last Observation Carried Forward: This method is specific to
longitudinal data problems. For each individual, missing values
are replaced by the last observed value of that variable. As can
be seen from Table 2 there are three missing values for unit 1,
at times 4, 5 and 6 are replaced by the value at time 3, namely 7.32. Likewise the missing value for unit 3, at time 6, has
been replaced by the value at time 5, which are 5.34.
Using LOCF, once the data set has been completed in this way it is analyzed
as if it were fully observed. The advantage of this technique is clearly that
the patient's own information is used to account for missing data. However,
for full longitudinal data analyses this is clearly disastrous, distorting means
and covariance structure. For single time point analyses the means are still
likely to be distorted, measures of precision are biased, and inferences are
therefore incorrect. This holds true even if the mechanism that causes the
data to be missing is completely random.
Another approach is called next observation carried backward (NOCB),
which simply means that missing values are replaced by the next observed
value of that variable. NOCB has the same advantages and limitations as
It is important to note unless the proportion of values missing is so small as
to be unlikely to affect inferences; these simple ad-hoc methods should be
avoided. However, 'small' is very hard to define. Estimates of the likelihood
of rare events can be highly sensitive to just a few missing observations.
Likewise, a sample mean can be sensitive to missing observations, which
are in the tails of the distribution.
Last and Next: Last and next is a method where missing values, if they are
preceded and followed by non-missing values, will be replaced by the average
or median of the preceding and following values. The availability of multiple
estimates to inform a single missing data time-point is attractive; however,
this technique has the same disadvantages as LOCF/NOCB.
How Do Approaches To Missing Data Compare?
We should consider how the various approaches to imputation stack up
against each other, based on a study by Engels and Diehr. This study evaluated
14 methods of imputation using data on self-reported depression levels
from the Cardiovascular Health Study (CHS) . The authors identified
situations where a person had a known value following one or more missing
values (i.e., the probability of that element being missing was high), and
treated the known value as a “missing value.” This “missing value” was
imputed using each method and compared to the observed value. Two techniques
(root mean-square deviation and mean absolute deviation) to assess
how close imputed values were to actual results.
The results of these 14 methods of imputation are shown in Figure 1. All the
techniques to the left of “Regression” are patient data-driven techniques. The
lower the root mean-square deviation, the closer the approximated values
match the actual values. To the right of LOCF there is a large variation in the
root mean-square values, which are also generally higher themselves.
The analysis revealed the power of using a
patient's own profile to impute missing data, leading the
authors to conclude that when available, data from a person's
own longitudinal history is superior for imputation purposes
than the other techniques discussed.
In conclusion, we can say that missing data are a fact of life in economic
evaluations; they arise for many different reasons and may lead to biased
estimates of resource use and cost. Decisions on the best approach to handling
them should be based on the nature, cause and mechanism of the missing information as well
as a good understanding of the observed and unobserved dynamics
in the database of interest. Wherever feasible use of data from
the individual's characteristic and/or longitudinal profile
should be used in imputation.
Little RJA, Rubin DB. Statistical Analysis with Missing Data. Hoboken, NJ: John Wiley & Sons, Inc.,
Hennessy S, et al. Descriptive analyses of the integrity of a US Medicaid claims database. Pharmacoepidemiol Drug Saf 2003;12:103-11.
Wood AM, White IR, Thompson SG. Are missing outcome data adequately handled? A review of published
randomized controlled trials in major medical journals. Clin Trials 2004;1:368-76.
Madow WG, Olkin I, Rubin D (Eds.) Incomplete data in sample surveys (1st Ed.), Theory and
Bibliographies. New York: Academic Press; 1983.
Brick JM, Kalton G. Handling missing data in survey research. Stat Methods Med Res 1996;5:215-
Holland PW, Welsch RE. Robust regression using iteratively reweighted least-squares. Comm Statist
Theory Methods 1977;A6:A813-27.
Engels JM, Diehr P. J Imputation of missing longitudinal data: a comparison of methods. Clin Epi