8.2 Models that are Resistant to Missing Values

Many popular predictive models such as support vector machines, the glmnet, and neural networks, cannot tolerate any amount of missing values. However, there are a few predictive models that can internally handle incomplete data71. Certain implementations of tree-based models have clever procedures to accommodate incomplete data. The CART methodology (Breiman et al. 1984) uses the idea of surrogate splits. When creating a tree, a separate set of splits are cataloged (using alternate predictors than the current predictor being split) that can approximate the original split logic if that predictor value is missing. Figure 8.6 displays the recursive partitioning model for the animal scat data. All three predictors selected by the tree contain missing values as illustrated in Figure 8.1. The initial split is based on the carbon/nitrogen ratio (CN < 8.7). When a sample has a missing value for CN, then the CART model uses an alternative split based on the indicator for whether the scat was flat or not. These two splits result in the same partitions for 80.6% of the data of the data and could be used if the carbon/nitrogen ratio is missing. Moving further down the tree, the surrogate predictors for d13C and Mass are Mass and d13C, respectively. This is possible since these predictors are not simultaneously missing.

A recursive partitioning tree for the animal scat data.  All three predictors in this tree contain missing values but can be used due to surrogate predictors.

Figure 8.6: A recursive partitioning tree for the animal scat data. All three predictors in this tree contain missing values but can be used due to surrogate predictors.

C5.0 (Quinlan 1993; Kuhn and Johnson 2013) takes a different approach. Based on the distribution of the predictor with missing data, fractional counts are used in subsequent splits. For example, this model’s first split for the scat data is d13C > 24.55. When this statement is true, all 13 training set samples are coyotes. However, there is a single missing value for this predictor. The counts of each species in subsequent nodes are then fractional due to adjusting for the number of missing values for the split variable. This allows the model to keep a running account of where the missing values might have landed in the partitioning.

Another method that can tolerate missing data is Naive Bayes. This method models the class-specific distributions of each predictor separately. Again, if the missing data mechanism is not pathological, then these distribution summaries can use the complete observations for each individual predictor and will avoid case-wise deletion.


  1. This is true provided that the mechanism that causes the missing data are not pathological.