The Official News & Technical Journal Of The International Society For Pharmacoeconomics And Outcomes Research

The Missing Link: Managing Missing Data in Economic Evaluations

Lieven Annemans PhD, MSc, Health Economist Ghent University, Senior Consultant Global Health Economic, HEDM and IMS Health, Brussels, Belgium, and Dan Ollendorf MPH, Vice President, Applied Research at PharMetrics (a unit of IMS), Watertown, MA, USA

The issue of missing data is a major challenge in economic evaluations, often impacting a researcher's ability to draw valid and conclusive inferences. Reliance on subsets of “complete” information or incorrect use of imputation techniques compounds the problem. Lieven Annemans and Dan Ollendorf report on some of the issues and solutions discussed in a recent IMS symposium, “Handling Missing Data in Economic Evaluations,” held at the ISPOR 11th Annual International Meeting in Philadelphia, PA, USA, May 2006. Speaking at this symposium was Dr. John R. Cook, Director of Health Economic Statistics at Merck; offering potential solutions was Mr. Daniel A. Ollendorf, Vice President, Applied Research at PharMetrics (a unit of IMS).

Health economic evaluations typically draw on longitudinal information - from prospective randomised clinical trials, retrospective databases, or medical file analyses. Any gaps in the information available thus present a major challenge for the successful completion of a meaningful evaluation that is based on valid assumptions.

An Inevitable Phenomenon
To a certain extent, missing data are an inevitable issue in economic evaluations. A feature of both prospective analyses (due to patient dropouts) and retrospective evaluations (due to the nature of databases) alike, missing information can occur in all types of data collection, from postal surveys to rigorously controlled clinical trials. The key challenge is to maximise usage of data that are available while minimising the bias introduced by the elements that are missing.

Why Does Data Go Missing?
(The term “missing data” essentially refers to observations which were intended to be collected but which for one reason or another were not.)

The nature of missing information is variable. In patient-completed surveys and prospectively controlled trials, for example, data are typically lost in one of two ways - systematically or randomly.

This can be illustrated using the title “The Importance of Handling Missing Data”. A systematic loss of data might take the form of “Th mprtnc f hndlng mssng dt” where the vowels are missing in every word. A random loss might appear as “The mprtac o ndin Missing Data” reflecting the lack of any specific pattern to the missing information. The important point is that in either of these cases, different messages will potentially emerge.

This will also be the case with other types of missing data which include:

  • Time point missing data: This is typically seen in patient reported outcome (PRO) studies or clinical trials due to a missed scheduled visit. Extending our title analogy, this would lead to ' . . Importance of Handling . . . Data'.

  • Censoring: Censoring is fairly common in clinical trials and also within retrospective databases. Trials, for example, are often truncated before patients reach the specified endpoint, and in longitudinal databases, the starting point for patients can occur after they have commenced treatment, leaving researchers without an anchor point. Similarly, patients can leave databases at varying points in time leading to gaps. Censoring can occur at both the beginning ('… of Handling Missing Data') or at the end of a trial ('The Importance of Handling …').

    Missing data mechanisms are important to understand, as the necessary adjustments will differ. We distinguish different mechanisms as follows:

  • Missing Completely at Random (MCAR): In cases where the probability of an observation being missing does not depend on observed or unobserved measurements, it can be said that the observation is missing completely at random, which is often abbreviated to MCAR. An example of a MCAR mechanism is a laboratory sample being dropped.

  • Missing at Random (MAR): With MAR the deficiency may depend on the observed data. The idea is that the data will be conditionally MAR. For example, if the probability that Y is missing conditional on X (vector of observed variables) and Y, then the probability that Y is missing depends on the value of X. For instance, non compliance may depend on adverse events. Given that adverse experience, the fact that one person has missing data versus another person with complete data is something that is a random phenomenon. In this case the probability that it is missing only depends on X, which is the presence or absence of that particular adverse experience. Conditional upon X there is a random sub-sample but data are needed to provide some information on the X's. One of the problems here noted Dr. Cook, particularly in retrospective studies, is that the complete observed data may not be available to allow the distinction and find that conditional independence.

  • Not Missing at Random (NMAR) - Noted to be one of the most problematic kinds of missing data - this is where the dropout depends upon the missing and unobserved data. An example might be when a form is sent out to patients asking them to respond within a set period of time to collect information on the number of outpatient care visits that they may have utilized. There may be instances where the patient is in the hospital at that time, thus the data will not be captured.

An Extensive Problem
There are many instances where insufficient information is available. As cited in the findings of several recent papers, including a study by Hennessy, et al, which examined the extent of missing data within the US Medicaid claims database in six states [2]. This specifically set out to capture information on prescriptions over time, hospitalizations and differences by age groups. A key finding was gaps in the prescription information, with some months showing numerous claims and others far fewer - spikes quite possibly due to the incomplete collection of data.

Similarly, a comparison of hospitalization records to claims data from CMS revealed differences from state to state that varied as much as 30 percent lower with the Medicaid claims data to 210 percent higher than that seen in the CMS data. Thus, even when the same information is being collected, it is possible to see where there may be systematic differences or large differences between the respective claims databases. 

Safe to Ignore?
So, what would happen if we simply ignore those patients who have missing data? As Dr. Cook observed, deleting records with missing values is common practice for dealing with item non-response, producing a reduced-size dataset of complete cases in which all the variables are fully observed. This approach has its advantages: it offers simplicity, since standard statistical packages can now be easily applied, and comparability, as all calculations proceed from a common base [1]. List-wise deletion involves discarding all cases with any missing values, which may be perfectly appropriate in numerous situations, particularly if the number of deleted incomplete cases is relatively small or if the deleted cases are very similar to the complete cases.

In many situations, however, discarding incomplete cases is disadvantageous. Firstly, if the deleted cases differ from the complete cases, estimates based on complete cases will be biased. Secondly, the precision of model estimates will be lower due to the smaller sample size.

Documented case studies clearly illustrate the pitfalls of disregarding incomplete cases. An assessment by Briggs, for example, which examined the cost of transurethral resection procedures (TURP) or contact laser vaporization, found that 10 percent of their chosen data points were missing overall but at least 45% of patients had some form of missing information. Of particular interest was the finding that excluding individuals with any missing data suggested that TURP was more expensive than laser treatment. However, when certain missing data strategies were applied, the reverse was found to be the case.

The case of the United Kingdom Prospective Diabetes Study (UK-PDS) looking at average hospital length of stay also underscores the point. While there was good information on hospitalization overall (around 4,000 patients and 7,500 records of hospitalization for type 2 diabetics) there were gaps on the length of stay, to the extent that about 16% of the observations had information missing. Since only those patients who were hospitalised had missing information, ignoring them would introduce a downward bias in terms of the estimated total number of hospital days or actual number of hospital days in; the population. And those individuals with multiple hospitalizations would be even more likely to have missing data which would further pull down the estimate - and bias it even more.

Before embarking on a research project, it is thus important to do as much as possible in terms of study design to make sure that the required information is collected and that the databases used have as much of the available data as possible. However, with the best will in the world, missing data will occur, leading to biased estimates of resource use as well as cost and cost effectiveness.

Potential Solutions
Missing data is significant in compromising the ability to analyze endpoints of interest. We consider some of the most popular approaches to the problem.

The most commonly used techniques to account for missing information typically involve its replacement with values derived from non-missing observations at the cohort level. In the majority of cases, a simple mean or median replacement is used.

Among the data shown in Table 1, for instance, there is one missing observation on unit 10, variable 2. This is replaced with the arithmetic mean of the observed data for that variable as shown in italics.

However, this approach is clearly inappropriate for data that consist of only a small number of values, each corresponding to a specific category value or label. Nor does it lead to proper estimates of measures of association or regression coefficients - rather, associations tend to be diluted. Furthermore, if the imputed values are treated as real, variances will typically be under estimated.

One potential application of using this relatively simple approach is in comparing reimbursement fields that typically show up in US claims databases. The economic fields in US claims data fields are generally interrelated, hence the feasibility of this technique. The “charged” amount represents the amount billed by the provider or facility for the service. The “allowed” is the amount that is contractually agreed between the payer and provider/facility. The “paid” amount is the sum reimbursed by payer net of patient contribution such as patient co-payment.

Paid amounts, which are critical to any US economic evaluation from the direct medical cost perspective, may be missing at rates of 5-15% depending on the payer. This could be attributable to the capping of reimbursement for that particular payer, i.e. the provider/facility receives a certain amount that is not subject to any adjustment based on individual service volume, which means that it will not capture payment for those individual services. There could also be billing or data entry errors or claim denials so the payment field is missing. Citing an internal study comparing a simple imputation of the median ratio of paid to charged amounts to more sophisticated multivariate techniques, simple imputation was found to outperform linear regression and be essentially comparable to weighted regression when results were compared to those from a validation sample in which all data were available.

Other approaches to account for missing longitudinal data include baseline based approaches and patient history-based approaches.

Baseline Based Approaches
Simple Class imputation: Similar to the simple population-based imputation methods, but within subgroups that are more or less homogenous.

Hot Deck: The hot deck imputation method involves a recipient (the person with missing data) and a donor (another person with similar characteristics, whose value is known) [3]. It involves then replacing the recipient's missing value with the donor's value at the appropriate time point. These class imputations imply that persons with missing data are a random sample of the persons in their class [4]. The advantage of the hot deck method is that it is fairly simple, uses concomitant information and can capture crude missing data mechanisms. One of hot deck's disadvantages is the difficulty in defining patient "similarity", as well as in the assumption that a missing data subset is generalizable to the class.

Iteratively Re-weighted Least-Squares Estimation: The ILRS method is weighting least-squares estimates with regression residuals from within the model [6]. As the name suggests, a weighted least squares fit is carried out inside an iteration loop. For each iteration, a set of weights for the observations is used in the least squares fit where initial weights are based on residuals from an initial fit. The ILRS can use both weighted and un-weighted least squares.

The advantage is that the computations are more traceable than other residual methods and the statistical properties easier to obtain. However, on the downside is the fact that weights are used to influence how each imputed value affects model parameters, where there is an implicit assumption that the scientist knows which values are likely to be high versus low quality (or, high vs. low variability); Secondly, the missing data is expected to follow a particular form that can guide the weighting and re-weighting process. Where good information does not exist on either of these two points, the exercise will be no better than simple methods.

Patient History-Based Approaches
Last Observation Carried Forward: This method is specific to longitudinal data problems. For each individual, missing values are replaced by the last observed value of that variable. As can be seen from Table 2 there are three missing values for unit 1, at times 4, 5 and 6 are replaced by the value at time 3, namely 7.32. Likewise the missing value for unit 3, at time 6, has been replaced by the value at time 5, which are 5.34.

Using LOCF, once the data set has been completed in this way it is analyzed as if it were fully observed. The advantage of this technique is clearly that the patient's own information is used to account for missing data. However, for full longitudinal data analyses this is clearly disastrous, distorting means and covariance structure. For single time point analyses the means are still likely to be distorted, measures of precision are biased, and inferences are therefore incorrect. This holds true even if the mechanism that causes the data to be missing is completely random.

Another approach is called next observation carried backward (NOCB), which simply means that missing values are replaced by the next observed value of that variable. NOCB has the same advantages and limitations as LOCF.

It is important to note unless the proportion of values missing is so small as to be unlikely to affect inferences; these simple ad-hoc methods should be avoided. However, 'small' is very hard to define. Estimates of the likelihood of rare events can be highly sensitive to just a few missing observations. Likewise, a sample mean can be sensitive to missing observations, which are in the tails of the distribution.

Last and Next: Last and next is a method where missing values, if they are preceded and followed by non-missing values, will be replaced by the average or median of the preceding and following values. The availability of multiple estimates to inform a single missing data time-point is attractive; however, this technique has the same disadvantages as LOCF/NOCB.

How Do Approaches To Missing Data Compare?
We should consider how the various approaches to imputation stack up against each other, based on a study by Engels and Diehr. This study evaluated 14 methods of imputation using data on self-reported depression levels from the Cardiovascular Health Study (CHS) [7]. The authors identified situations where a person had a known value following one or more missing values (i.e., the probability of that element being missing was high), and treated the known value as a “missing value.” This “missing value” was imputed using each method and compared to the observed value. Two techniques (root mean-square deviation and mean absolute deviation) to assess how close imputed values were to actual results.

The results of these 14 methods of imputation are shown in Figure 1. All the techniques to the left of “Regression” are patient data-driven techniques. The lower the root mean-square deviation, the closer the approximated values match the actual values. To the right of LOCF there is a large variation in the root mean-square values, which are also generally higher themselves.

The analysis revealed the power of using a patient's own profile to impute missing data, leading the authors to conclude that when available, data from a person's own longitudinal history is superior for imputation purposes than the other techniques discussed.

In conclusion, we can say that missing data are a fact of life in economic evaluations; they arise for many different reasons and may lead to biased estimates of resource use and cost. Decisions on the best approach to handling them should be based on the nature, cause and mechanism of the missing information as well as a good understanding of the observed and unobserved dynamics in the database of interest. Wherever feasible use of data from the individual's characteristic and/or longitudinal profile should be used in imputation.


  1. Little RJA, Rubin DB. Statistical Analysis with Missing Data. Hoboken, NJ: John Wiley & Sons, Inc., 2002.

  2. Hennessy S, et al. Descriptive analyses of the integrity of a US Medicaid claims database. Pharmacoepidemiol Drug Saf 2003;12:103-11.

  3. Wood AM, White IR, Thompson SG. Are missing outcome data adequately handled? A review of published randomized controlled trials in major medical journals. Clin Trials 2004;1:368-76.

  4. Madow WG, Olkin I, Rubin D (Eds.) Incomplete data in sample surveys (1st Ed.), Theory and Bibliographies. New York: Academic Press; 1983.

  5. Brick JM, Kalton G. Handling missing data in survey research. Stat Methods Med Res 1996;5:215-

  6. Holland PW, Welsch RE. Robust regression using iteratively reweighted least-squares. Comm Statist Theory Methods 1977;A6:A813-27.

  7. Engels JM, Diehr P. J Imputation of missing longitudinal data: a comparison of methods. Clin Epi 2003;56:968-76.

  Issues Index | 2007 Issues Index