5.3 Approaches for Novel Categories

Suppose that a model is built to predict the probability that an individual works in a STEM profession and that this model depends on geographic location (e.g. city). The model will be able to predict the probability of STEM profession if a new individual lives in one of the 20 cities. But what happens to the model prediction when a new individual lives in a city that is not represented in the original data? If the models are solely based on dummy variables, then the models will not have seen this information and will not be able to generate a prediction.

If there is a possibility of encountering a new category in the future, one strategy would be to use the previously mentioned “other” category to capture new values. While this approach may not be the most effective at extracting predictive information relative to the response for this specific category, it does enable the original model to be applied to new data without completely refitting it. This approach can also be used with feature hashing since the new categories are converted into a unique “other” category that is hashed in the same manner as the rest of the predictor values; however, we do need to ensure that the “other” category is present in the training/testing data. Alternatively we could ensure that the “other” category collides with another hashed category so that the model can be used to predict a new sample. More approaches to novel categories are given in the next section.

Note that the concept of a predictor with novel categories is not an issue in the training/testing phase of a model. In this phase the categories for all predictors in the existing data are known at the time of modeling. If a specific predictor category is present in only the training set (or vice-versa), a dummy variable can still be created for both data sets although it will be an zero-variance predictor in one of them.