8.4 Encoding Missingness

When a predictor is discrete in nature, missingness can be directly encoded into the predictor as if it were a naturally occurring category. This makes sense for structurally missing values such as the example of alleys in the Ames housing data. Here, it is sensible to change the missing values to a category of “no alley.” In other cases, the missing values could simply be encoded as “missing” or “unknown.” For example, Kuhn and Johnson (2013) use a data set where the goal is to predict the acceptance or rejection of grant proposals. One of the categorical predictors was grant sponsor which took values such as “Australian competitive grants”, “cooperative research centre”, “industry”, etc.. In total, there were more than 245 possible values for this predictor with roughly 10% of the grant applications having an empty sponsor value. To enable the applications that had an empty sponsor to be used in modeling, empty sponsor values were encoded as “unknown”. For many of the models that were investigated, the indicator for an unknown sponsor was one of the most important predictors of grant success. In fact, the odds-ratio that contrasted known versus unknown sponsor was greater than 6. This means that it was much more likely for a grant to be successfully funded if the sponsor predictor was unknown. In fact, in the training set the grant success rate associated with an unknown sponsor was 82.2% versus 42.1% for a known sponsor.

Was encoding the missing values a successful strategy? Clearly, the mechanism that led to the missing sponsor label being identified as strongly associated with grant acceptance was genuinely important. Unfortunately, it is impossible to know why this association is so important. The fact that something was going on here is important and this encoding helped identify its occurrence. However, it would be troublesome to accept this analysis as final and imply some sort of cause-and-effect relationship⁷². A guiding principle that can be used to determine if encoding missingness is a good idea is to think about how the results would be interpreted if that piece of information becomes important to the model.

A perhaps more difficult situation would be explaining to the consumers of the model that “We know that this is important but we don’t know why!”↩