5 Encoding Categorical Predictors

Categorical or nominal predictors are those that contain qualitative data. For the OkCupid data, examples include education (e.g., high school, two-year college, college, etc.) and diet (e.g., anything, vegan, vegetarian, etc.). In the Ames data, the type of house and neighborhood are predictors that have no numeric scale. However, while numeric, the ZIP code also qualifies as a qualitative predictor because the numeric values have no continuous meaning. Simply put, there is no reason that a ZIP Code of 21212 is 14,827 “more” than a ZIP Code of 06385.

Categorical predictors also can be derived from unstructured or open text. The OkCupid data contains sets of optional essays that describe individuals’ interests and other personal information. Predictive models that depend on numeric inputs cannot directly handle open text fields. Instead, these information-rich data need to be processed prior to presenting the information to a model. Text information can be processed in different ways. For example, if keywords are extracted from text to use as predictors, then should these include single words or strings of words? Should words be stemmed so that a root word is used (e.g., “comput” being short for “computer”, “computers”, “computing”, “computed” etc.) or should a regular expression pattern-matching approach be used (e.g., ^comput)?

Simple categorical variables can also be classified as ordered or unordered. A variable with values “Bad”, “Good”, and “Better” shows a clear progression of values. While the difference between these categories may not be precisely numerically quantifiable, there is a meaningful ordering. To contrast, consider another variable that takes values of “French”, “Indian”, or “Peruvian”. These categories have no meaningful ordering. Ordered and unordered factors might require different approaches for including the embedded information in a model.

As with other preprocessing steps, the approach to including the predictors depends on the type of model. A large majority of models require that all predictors be numeric. There are, however, some exceptions. Algorithms for tree-based models can naturally handle splitting numeric or categorical predictors. These algorithms employ a series if/then statements that sequentially split the data into groups. For example, in the Chicago data, the day of the week is a strong predictor and a tree-based model would likely include a model component such as

if day in {Sun, Sat} then ridership = 4.4K
 else ridership = 17.3K

As another example, a naive Bayes model (Section 12.1) can create a cross-tabulation between a categorical predictor and the outcome class and this frequency distribution is factored into the model’s probability calculations. In this situation, the categories can be processed in their natural format. The final section of this chapter investigates how categorical predictors interact with tree-based models.

Tree-based and naive Bayes models are exceptions; most models require that the predictors take numeric form. This chapter focuses primarily on methods that encode categorical data to numeric values.

Many of the analyses in this chapter use the OkCupid data that were introduced in Section 3.1 and discussed in the previous chapter. In this chapter, we will explore specific variables in more detail. One issue in doing this is related to overfitting. Compared to other data sets discussed in this book, the OkCupid data set contains a large number of potential categorical predictors, many of which have a low prevalence. It would be problematic to take the entire training set, examine specific trends and variables, then add features to the model based on these analyses. Even though we cross-validate the models on these data, we may overfit and find predictive relationships that are not generalizable to other data. The data set is not small; the training set consists of 38,809 data points. However, for the analyses here, a subsample of 5,000 STEM profiles and 5,000 non-STEM profiles were selected from the training set for the purpose of discovering new predictors for the models (consistent with Section 3.8).