8 Handling Missing Data

Missing data are not rare in real data sets. In fact, the chance that at least one data point is missing increases as the data set size increases. Missing data can occur any number of ways, some of which include the following.

Merging of source data sets: A simple example commonly occurs when two data sets are merged by a sample identifier (ID). If an ID is present in only the first data set, then the merged data will contain missing values for that ID for all of the predictors in the second data set.
Random events: Any measurement process is vulnerable to random events that prevent data collection. Consider the setting where data are collected in a medical diagnostic lab. Accidental misplacement or damage of a biological sample (like blood or serum) would prevent measurements from being made on the sample, thus inducing missing values. Devices that collect actigraphy data can also be affected by random events. For example, if a battery dies or the collection device is damaged then measurements cannot be collected and will be missing in the final data.
Failures of measurement: Measurements based on images require that an image be in focus. Images that are not in focus or are damaged can induce missing values. Another example of a failure of measurement occurs when a patient in a clinical study misses a scheduled physician visit. Measurements that would have been taken for the patient at that visit would then be missing in the final data.

The goal of feature engineering is to get the predictors into a form which models can better utilize in relating the predictors to the response. For example, projection methods (Section 6.3) or autoencoder transformations (Section 6.3.2) for continuous predictors can lead to a significant improvement in predictive performance. Likewise a feature engineering maneuver of likelihood encoding for categorical predictors (Section 5.4) could be predictively beneficial. These feature engineering techniques, as well as many others discussed throughout this work, require that the data have no missing values. Moreover, missing values in the original predictors, regardless of any feature engineering, are intolerable in many kinds of predictive models. Therefore, to utilize predictors or feature engineering techniques, we must first address the missingness in the data. Also, the missingness itself may be an important predictor of the response.

In addition to measurements being missing within the predictors, measurements may also be missing within the response. Most modeling techniques cannot utilize samples that have missing response values in the training data. However a new approach known as semi-supervised learning has the ability to utilize samples with unknown response values in the training set. Semi-supervised learning methods are beyond the scope of this chapter; for further reference see Zhu and Goldberg (2009). Instead this chapter will focus on methods for resolving missing values within the predictors.

Types of Missing Data

The first and most important question when encountering missing data is “why are these values missing?” Sometimes the answer might already be known or could be easily inferred from studying the data. If the data stem from a scientific experiment or clinical study, information from laboratory notebooks or clinical study logs may provide a direct connection to the samples collected or to the patients studied that will reveal why measurements are missing. But for many other data sets, the cause of missing data may not be able to be determined. In cases like this, we need a framework for understanding missing data. This framework will, in turn, lead to appropriate techniques for handling the missing information.

One framework to view missing values is through the lens of the mechanisms of missing data. Three common mechanisms are:

Structural deficiencies in the data
Random occurrences, or
Specific causes.

A structural deficiency can be defined as a missing component of a predictor that was omitted from the data. This type of missingness is often the easiest to resolve once the necessary component is identified. The Ames housing data provides an example of this type of missingness in the Alley predictor. This predictor takes values of “gravel” or “paved”, or is missing. Here, 93.2 percent of homes have a missing value for alley. It may be tempting to simply remove this predictor because most of the values are missing. However doing this would throw away valuable predictive information of home price since missing, in this case, means that the property has no alley. A better recording for the Alley predictor might be to replace the missing values with “No Alley Access”.

A second reason for missing values is due to random occurrences. Little and Rubin (2014) subdivide this type of randomness into two categories:

Missing completely at random (MCAR): the likelihood of a missing results is equal for all data points (observed or unobserved). In other words, the missing values are independent of the data. This is the best case situation.
Missing at random (MAR): the likelihood of a missing results is not equal for all data points (observed or unobserved). In this scenario, the probability of a missing result depends on the observed data but not on the unobserved data.

In practice it can be very difficult or impossible to distinguish if the missing values have the same likelihood of occurrence for all data, or have different likelihoods for observed or unobserved data. For the purposes of this text, the methods described herein can apply to either case. We refer the reader to Little and Rubin (2014) for a deeper understanding of these nuanced cases.

A third mechanism of missing data is missingness due to a specific cause (or not missing at random (NMAR) (Little and Rubin 2014)). This type of missingness often occurs in clinical studies where patients are measured periodically over time. For example, a patient may drop out of a study due to an adverse side effect of a treatment or due to death. For this patient, no measurements will be recorded after the time of drop-out. Data that are not missing at random are the most challenging to handle. The techniques presented here may or may not be appropriate for this type of missingness. Therefore, we must make a good effort to understanding the nature of the missing data prior to implementing any of the techniques described below.

This chapter will illustrate ways for assessing the nature and severity of missing values in the data, highlight models that can be used when missing values are present, and review techniques for removing or imputing missing data.

For illustrations, we will use the Chicago train ridership data (Chapter 4) and the scat data from Reid (2015) (previously seen in Section 6.3.3). The latter data set contains information on animal droppings that were collected in the wild. A variety of measurements were made on each sample including morphological observations (i.e., shape), location/time information, and laboratory tests. The DNA in each sample was genotyped to determine the species of the sample (gray fox, coyote, or bobcat). The goal for these data is to build a predictive relationship between the scat measurements and the species. After gathering a new scat sample, the model would be used to predict the species that produced the sample. Out of the 110 collected scat samples, 19 had one or more missing predictor values.