5 Encoding Categorical Predictors

Categorical or nominal predictors are those that contain qualitative data. For the OkCupid data, examples include education (e.g. high school, two year college, college, etc.) and diet (e.g. anything, vegan, vegetarian, etc.). In the Ames data, the type of house and neighborhood are predictors that have no numeric scale. However, while numeric, the ZIP code also qualifies as a qualitative predictor because the numeric values have no continuous meaning. Simply put, there is no reason that a zip code of 21212 is 14,827 “more” than a zip code of 06385.

Categorical predictors also can be derived from unstructured or open text. The OkCupid data contains sets of optional essays that describe individuals’ interests and other personal information. Predictive models that depend on numeric inputs cannot directly handle open text fields. Instead, these information rich data need to be processed prior to presenting the information to a model. Text information can be processed in different ways. For example, if keywords are extracted from text to use as predictors, then should these include single words or strings of words? Should words be stemmed so that a root word is used (e.g. “comput” being short for “computer”, “computers”, “computing”, “computed” etc.) or should a regular expression pattern matching approach be used (e.g. ^comput)?

Simple categorical variables can also be classified as ordered or unordered. A variable with values “Bad”, “Good”, and “Better” shows a clear progression of values. While the difference between these categories may not be precisely numerically quantifiable, there is a meaningful ordering. To contrast, consider another variable that takes values of “French”, “Indian”, or “Peruvian”. These categories have no meaningful ordering. Ordered and unordered factors might require different approaches for including the embedded information in a model.

As with other preprocessing steps, the approach to including the predictors depends on the type of model. A large majority of models require that all predictors be numeric. There are, however, some exceptions. Algorithms for tree-based models can naturally handle splitting numeric or categorical predictors. These algorithms employ a series if/then statements that sequentially split the data into groups. For example, in the Chicago data, the day of the week is a strong predictor and a tree-based model would likely include a model component such as

if day in {Sun, Sat} then ridership = 4.4K
 else ridership = 17.3K

As another example, a naive Bayes model can create a cross-tabulation between a categorical predictor and the outcome class and this frequency distribution is factored into the model’s probability calculations. In this situation, the categories can be processed in their natural format. The final section of this chapter investigates how categorical predictors interact with tree-based models.

Tree-based and naive Bayes models are exceptions; most models require that the predictors take numeric form. This chapter focuses primarily on methods that encode categorical data to numeric values.

Many of the analyses in this chapter use the OkCupid data that were introduced in Section 3.1 and discussed in the previous chapter. In this chapter, we will explore specific variables in more detail. One issue in doing this is related to overfitting. Compared to other data sets discussed in this book, the OkCupid data set contains a large number of potential categorical predictors, many of which have a low prevalence. It would be problematic to take the entire training set, examine specific trends and variables, then add features to the model based on these analyses. Even though we cross-validate the models on these data, we may overfit and find predictive relationships that are not generalizable to other data. The data set is not small; the training set consists of 38,809 data points. However, for the analyses here, a subsample of 5,000 STEM profiles and 5,000 non-STEM profiles were selected from the training set for the purpose of discovering new predictors for the models.

5.1 Creating Dummy Variables for Unordered Categories

The most basic approach to representing categorical values as numeric data is to create dummy variables. These are artificial numeric variables that capture some aspect of one (or more) of the categorical values. There are many methods for doing this and, to illustrate, let’s first consider a simple example for the day of the week. If we take the seven possible values and convert them into binary dummy variables, the mathematical function required to make the translation is often referred to as a contrast or parameterization function. An example of a contrast function is called the “reference cell” or “treatment” contrast where one of the values of the predictor is left unaccounted for in the resulting dummy variables. Using Sunday as the reference cell, the contrast function would create six dummy variables:

Original Value
Dummy Variables
Mon Tues Wed Thurs Fri Sat
Sun 0 0 0 0 0 0
Mon 1 0 0 0 0 0
Tues 0 1 0 0 0 0
Wed 0 0 1 0 0 0
Thurs 0 0 0 1 0 0
Fri 0 0 0 0 1 0
Sat 0 0 0 0 0 1

These six numeric predictors would take the place of the original categorical variable.

Why only six? There are two related reasons. First, if you know the values of the six dummy variables, you can directly infer the seventh. The second reason is more technical. When fitting linear models, the design matrix \(X\) is created. When the model has an intercept, an additional initial column of ones for all rows is included. Estimating the parameters for a linear model (as well as other similar models) involve inverting the matrix \((X'X)\). If the model includes an intercept and contains dummy variables for all seven days, then the seven day columns would add up (row-wise) to the intercept and this linear combination would prevent the matrix inverse from being computed (as it is singular). When this occurs, the design matrix said to be less than full rank or overdetermined. When there are \(C\) possible values of the predictor and only \(C-1\) dummy variables are used, the matrix inverse can be computed and the contrast method is said to be a full rank parameterization (Haase 2011).

How would we interpret these dummy variables? That depends on what type of model is being used but consider a linear model for the Chicago transit data that only uses the day of the week in the model with the reference cell parameterization above. Using the training set to fit the model, the intercept value estimates the mean of the reference cell, which is the average number of Sunday riders in the training set, and was estimated to be 3.84K people. The second model parameter, for Monday, is estimated to be 12.61K. In the reference cell model, the dummy variables represent the mean value above and beyond the reference cell mean. In this case, estimate indicates that there were 12.61K more riders on Monday than Sunday. The overall estimate of Monday ridership adds the estimates from the intercept and dummy variable (16.45K rides).

When there are more than one categorical predictors, the reference cell becomes multidimensional. Suppose there were a predictor for the weather that has only a few values: “clear”, “cloudy”, “rain”, and “snow”. Let’s consider “clear” to be the reference cell. If this variable were included in the model, the intercept would correspond to the mean of the Sundays with a clear sky. The interpretation of each set of dummy variables does not change. The average ridership for a cloudy Monday would augment the average clear Sunday ridership with the average incremental effect of cloudy and the average incremental effect of Monday.

There are other contrast functions for dummy variables. The “cell means” parameterization (Timm and Carlson 1975) would create a dummy variable for each day of the week and would not include an intercept to avoid the problem of singularity of the design matrix. In this case, the estimates for each dummy variable would correspond to the average value for that category. For the Chicago data, the parameter estimate for the Monday dummy variable would simply be 16.45K.

There are several other contrast methods for constructing dummy variables. Another, based on polynomial functions, is discussed below for ordinal data.

5.2 Encoding Predictors with Many Categories

If there are \(C\) categories, what happens when \(C\) becomes very large? For example, ZIP code in the United States may be an important predictor for outcomes that are affected by a geographic component. There are more than 40K possible ZIP codes and, depending on how the data are collected, this might produce an overabundance of dummy variables (relative to the number of data points). As mentioned in the previous section, this can cause the data matrix to be overdetermined and restrict the use of certain models. Also, ZIP codes in highly populated areas may have a higher rate of occurrence in the data, leading to a “long tail” of locations that are infrequently observed.

One potential issue is that resampling might exclude some of the more rare categories from the analysis set. This would lead to dummy variable columns in the data that contain all zeros and, for many models, this would become a numerical issue that will cause an error. Moreover the model will not be able to provide a relevant prediction for new samples that contain this predictor. When a predictor contains a single value, we call this a zero-variance predictor because there truly is no variation displayed by the predictor.

The first way to handle this issue is to create the full set of dummy variables and simply remove the zero-variance predictors. This is a simple and effective approach but it may be difficult to know a priori what terms will be in the model. In other words, during resampling, there may be a different number of model parameters across resamples. This can be a good side-effect since it captures the variance caused by omitting rarely occurring values and propagates this noise into the resampling estimates of performance.

Prior to producing dummy variables, one might consider determining which (if any) of these variables are near-zero variance predictors or have the potential to have near zero variance during the resampling process. These are predictors that have few unique values (such as two values for binary dummy variables) and occur infrequently in the data. For the training set, we would consider the ratio of the frequencies of the most commonly occurring value to the second most commonly occurring. For dummy variables, this is simply the ratio of the numbers of ones and zeros. Suppose that a dummy variable has 990 values of zero and 10 ones. The ratio of these frequencies, 99, indicates that it could easily be converted into all zeros during resampling. We suggest a rough cutoff of 19 to declare such a variable “too rare” although the context may raise or lower this bar for different problems.

Although near-zero variance predictors likely contain little valuable predictive information, we may not desire to filter these out. One way to avoid filtering these predictors is to redefine the predictor’s categories prior to creating dummy variables. Instead we can create an “other” category that pools the rarely occurring categories, assuming that such a pooling is sensible. A cutoff based on the training set frequency can be specified that indicates which categories should be combined. Again, this should occur during resampling so that the final estimates reflect the variability in which predictor values are included in the “other” group.

Another way to combine categories is to use a hashing function (or hashes). Hashes are used to map one set of values to another set of values and are typically used in databases and cryptography (Preneel 2010). The original values are called the keys and are typically mapped to a smaller set of artificial hash values. In our context, a potentially large number of predictor categories are the keys and we would like to represent them using a smaller number of categories (i.e. the hashes). The number of possible hashes is set by the user and, for numerical purposes, is a power of 2. Some computationally interesting aspects to hash functions are

  1. The only data required is the value being hashed and the resulting number of hashes. This is different from the previously described contrast function using a reference cell.
  2. The translation process is completely deterministic.
  3. In traditional applications, it is important that the number of keys that are mapped to hashes is relatively uniform. This would be important if the hashes are used in cryptography since it increases the difficulty of guessing the result. As discussed below, this might not be a good characteristic for data analysis.
  4. Hash functions are unidirectional; once the hash values are created, there is no way of knowing the original values. If there are a known and finite set of original values, a table can be created to do the translation but, otherwise, the keys are indeterminable when only the hash value is known.
  5. There is no free lunch when using this procedure; some of the original categories will be mapped to the same hash value (called a “collision”). The number of collisions will be largely determined by the number of features that are produced.

When using hashes to create dummy variables, the procedure is called “feature hashing” or the “hash trick” (Weinberger et al. 2009). Since there are many different hashing functions, there are different methods for producing the table that maps the original set to the reduced set of hashes.

As a simple example, a sample of 20 locations from the OkCupid data were hashed39. Five of these cities are used to illustrate the details:

Hash Feature Number
Hash Integer Value 16 cols 256 cols
mountain view -1267876914 15 207
martinez -157684639 2 98
berkeley 1166288024 9 153
south san francisco -373608504 9 201
san mateo 986716290 3 131
san leandro 1219729949 14 30

The hash value is an integer value derived solely from each city’s text value. In order to convert this to a new feature in the data set, modular arithmetic is used. Suppose that 16 features are desired. To create a mapping from the integer to one of the 16 feature columns, integer division is used on the hash value using the formula \(column = (integer\mod 16) + 1\). For example, for the string “mountain view”, we have \((-1267876914 \mod 16) + 1 = 15\), so this city is mapped to the fifteenth dummy variable feature. The last column above shows the assignments for the cities when 256 feature are requested; \(\mod 256\) has more possible remainders than \(\mod 16\) so more features are possible. With 16 features, the dummy variable assignments for all of the cities in the data set are:

Original Value
Hash Variables
1 2 3 4 5 6 9 12 13 14 15 16
alameda 1 0 0 0 1 0 0 0 0 0 0 0
belmont 1 0 0 0 0 0 0 0 0 0 1 0
benicia 1 0 0 1 0 0 0 0 0 0 0 0
berkeley 1 0 0 0 0 0 1 0 0 0 0 0
castro valley 1 0 0 0 0 0 0 1 0 0 0 0
daly city 1 0 0 0 0 0 0 0 0 0 1 0
emeryville 1 0 0 0 0 0 0 0 1 0 0 0
fairfax 1 1 0 0 0 0 0 0 0 0 0 0
martinez 1 1 0 0 0 0 0 0 0 0 0 0
menlo park 1 0 0 0 0 0 0 0 0 1 0 0
mountain view 1 0 0 0 0 0 0 0 0 0 1 0
oakland 1 0 0 0 0 0 0 1 0 0 0 0
other 1 0 0 0 0 0 1 0 0 0 0 0
palo alto 1 1 0 0 0 0 0 0 0 0 0 0
san francisco 1 0 0 0 0 1 0 0 0 0 0 0
san leandro 1 0 0 0 0 0 0 0 0 1 0 0
san mateo 1 0 1 0 0 0 0 0 0 0 0 0
san rafael 1 0 0 0 0 0 0 0 0 0 0 1
south san francisco 1 0 0 0 0 0 1 0 0 0 0 0
walnut creek 1 1 0 0 0 0 0 0 0 0 0 0

There are several characteristics of this table to notice. First there are 4 hashes (columns) that are not shown in this table since they contain all zeros. This occurs because none of the hash integer values had those specific mod 16 remainders. This could be a problem for predictors that have the chance to take different string values than the training data and would be placed in one of these columns. In this case, it would be wise to create an ‘other’ category to ensure that new strings have a hashed representation. Next notice that there were 6 hashes with no collisions (i.e. have a single 1 in their column). But several columns exhibit collisions. An example of a collision is hash number 14 which encodes for both Menlo Park and San Leandro. In statistical terms, these two categories are said to be aliased or confounded, meaning that a parameter estimated from this hash cannot isolate the effect of either city. Alias structures have long been studied in the field of statistical experimental design although usually in the context of aliasing between variables rather than within variables (Box, Hunter, and Hunter 2005). Regardless, from a statistical perspective, minimizing the amount and degree of aliasing is an important aspect of a contrast scheme.

Some hash functions can also be signed, meaning that instead of producing a binary indicator for the new feature, possible values could be -1, 0, or +1. A zero would indicate that the category is not associated with that specific feature and the \(\pm 1\) values can indicate different values of the original data. Suppose that two categories map to the same hash value. A binary hash would result in a collision but a signed hash could avoid this by encoding one category as +1 and the other as -1. Here are the same data as above with a signed hash of the same size:

Original Value
Hash Variables
1 2 3 4 5 6 9 12 13 14 15 16
alameda 1 0 0 0 1 0 0 0 0 0 0 0
belmont 1 0 0 0 0 0 0 0 0 0 1 0
benicia 1 0 0 1 0 0 0 0 0 0 0 0
berkeley 1 0 0 0 0 0 -1 0 0 0 0 0
castro valley 1 0 0 0 0 0 0 -1 0 0 0 0
daly city 1 0 0 0 0 0 0 0 0 0 1 0
emeryville 1 0 0 0 0 0 0 0 1 0 0 0
fairfax 1 -1 0 0 0 0 0 0 0 0 0 0
martinez 1 1 0 0 0 0 0 0 0 0 0 0
menlo park 1 0 0 0 0 0 0 0 0 1 0 0
mountain view 1 0 0 0 0 0 0 0 0 0 -1 0
oakland 1 0 0 0 0 0 0 1 0 0 0 0
other 1 0 0 0 0 0 1 0 0 0 0 0
palo alto 1 1 0 0 0 0 0 0 0 0 0 0
san francisco 1 0 0 0 0 -1 0 0 0 0 0 0
san leandro 1 0 0 0 0 0 0 0 0 -1 0 0
san mateo 1 0 1 0 0 0 0 0 0 0 0 0
san rafael 1 0 0 0 0 0 0 0 0 0 0 -1
south san francisco 1 0 0 0 0 0 -1 0 0 0 0 0
walnut creek 1 1 0 0 0 0 0 0 0 0 0 0

Notice that the collision in hash 14 is now gone since a value of 1 corresponds to Menlo Park and San Leandro is encoded as -1. For a linear regression model, the coefficient for the 14th hash now represents the positive or negative shift in the response due to Menlo Park or San Leandro, respectively.

What level of aliasing should we expect? To investigate, data were simulated so that there were \(10^4\) predictor categories and each category consisted of 20 unique, randomly generated character values. Binary and signed hashes of various sizes were generated and the collision rate was determined. This simulation was done 10 different times and the average percent collision was calculated. Figure 5.1 shows the results where the y-axis is the rate of any level of collision in the features. As expected, the signed values are associated with less collisions than the binary encoding. The green line corresponds to the full set of dummy variables (without using hashes). For these data, using signed features would give the best aliasing results but the percentage of collisions is fairly high.

The average rate of collisions in a simulation study on a predictor with 10K categories.

Fig. 5.1: The average rate of collisions in a simulation study on a predictor with 10K categories.

Because of how the new features are created, feature hashing is oblivious to the concept of predictor aliasing. Uniformity of the hash function is an important characteristic when hashes are used in their typical applications. However, it may be a liability in data analysis. The reduction of aliasing is an important concept in statistics; we would like to have the most specific representations of categories as often as possible for a few different reasons:

  • Less aliasing allows for better interpretation of the results. If a predictor has a significant impact on the model, it is probably a good idea understand why. When multiple categories are aliased due to a collision, untangling the true effect may be very difficult. Although this may help us narrow down which categories are affecting the response.
  • Categories involved in collisions are not related in any meaningful way. For example, San Leandro and Menlo Park were not aliased because they share a particular similarity or geographic closeness. Because of the arbitrary nature of the collisions, it is possible to have different categories whose true underlying effect are counter to one another. This might have the effect of negating the impact of the hashed feature.
  • Hashing functions have no notion of the probability that each key will occur. As such, it is conceivable that a category that occurs with great frequency is aliased with one that is rare. In this case, the more abundant value will have a much larger influence on the effect of that hashing feature.

Although we are not aware of any statistically-conscious competitor to feature hashing, there are some characteristics that a contrast function should have when used with a predictor containing a large number of categories40. For example, if there is the capacity to have some categories unaliased with any others, the most frequently occurring values should be chosen for these roles. The reason for this suggestion is that it would generate a high degree of specificity for the categories that are most likely to be exposed to the contrast function. One side effect of this strategy would be to alias less frequently occurring values to one another. This might be beneficial since it reduces the sparsity of those dummy variables. For example, if two categories that each account for 5% of the training set values are combined, the dummy variable has a two-fold increase in the number of ones. This would reduce the chances of being filtered out during resampling and might also lessen the potential effect that such a near-zero variance predictor might have on the analysis.

5.3 Approaches for Novel Categories

Suppose that we have built a model to predict the probability that an individual works in a STEM profession and that this model depends on geographic location (e.g. city). The model will be able to predict the probability of STEM profession if a new individual lives in one of the 20 cities. But what happens to the model prediction when a new individual lives in a city that is not represented in the original data? if the models are solely based on dummy variables, then the models will not have seen this information and will not be able to generate a prediction.

If there is a possibility of encountering a new category in the future, one strategy would be to use the previously mentioned “other” category to capture new values. While this approach may not be the most effective at extracting predictive information relative to the response for this specific category, it does enable the original model to be used without completely refitting it. This approach can also be used with feature hashing since the new categories are converted into a unique “other” category that is hashed in the same manner as the rest of the predictor values; however, we do need to ensure that the “other” category is present in the training/testing data. Alternatively we could ensure that the “other” category collides with another hashed category so that the model can be used to predict a new sample.

Note that the concept of a predictor with novel categories is not an issue in the training/testing phase of a model. In this phase the categories for all predictors in the existing data are known at the time of modeling. If a specific predictor category is present in only the training set (or vice-versa), a dummy variable can still be created for both data sets although it will be an zero-variance predictor in one of them.

5.4 Encodings for Ordered Data

As we saw from the previous sections an unordered predictor with \(C\) categories can be represented by \(C-1\) binary dummy variables or a hashed version of binary dummy variables. These methods effectively present the categorical information to the models. But now suppose that the \(C\) categories have a relative ordering. For example, consider a predictor that has the categories of “low”, “medium”, and “high.” We could convert this information to 2 binary predictors: \(X_{low} = 1\) for samples categorized as “low” and = 0 otherwise, and \(X_{medium} = 1\) for samples categorized as “medium” and = 0 otherwise. These new predictors would accurately identify low, medium, and high samples, but the two new predictors would miss the information contained in the relative ordering which could be very important relative to the response.

Ordered categorical predictors would need a different way to be presented to a model to uncover the relationship of the order to the response. Ordered categories may have a linear relationship with the response. For instance we may see an increase of approximately 5 units in the response as we move from low to medium and an increase of approximately 5 units in the response as we move from medium to high. To allow a model to uncover this relationship we must present the model with a numeric encoding of the ordered categories that represents a linear ordering. In the field of Statistics this type of encoding is referred to as a polynomial contrast. A contrast has the characteristic that it is a single comparison (i.e. one degree of freedom) and its coefficients sum to zero. For the “low”, “medium”, “high” example above the contrast to uncover a linear trend would be -0.71, 0, 0.71, where low samples encode to -0.71, medium samples to 0, and high samples to 0.71. Polynomial contrasts can extend to nonlinear shapes, too. If the relationship between the predictor and the response was best described by a quadratic trend, then the contrast would be 0.41, -0.82, and 0.41 (Table 5.1). Conveniently, these types of contrasts can be generated for predictors with any number of ordered factors, but the complexity of the contrast is constrained to one less than the number of categories in the original predictor. For instance, we could not explore a cubic relationship between a predictor and the response with a predictor that has only 3 categories.

By employing polynomial contrasts, we can investigate multiple relationships (linear, quadratic, etc.) simultaneously by including these in the same model. When we do this, the form of the numeric representation is important. Specifically we desire the new predictors to contain unique information. To do this the numeric representations are required to be orthogonal. This means that the dot product of the contrast vectors is 0.

Table 5.1: An example of linear and quadratic polynomial contrasts for an ordered categorical predictor with three levels.
Original Value
Dummy Variables
Linear Quadratic
low -0.71 0.41
medium 0.00 -0.82
high 0.71 0.41

It is important to recognize that patterns described by polynomial contrasts may not effectively relate a predictor to the response. For example, in some cases, one might expect a trend where “low” and “middle” samples have a roughly equivalent response but “high” samples have a much different response. In this case, polynomial contrasts are unlikely to be effective at modeling this trend. Another downside to polynomial contrasts for ordered categories occurs when there are a moderate to high number of categories. If an ordered predictor has \(C\) levels, the encoding into dummy variables uses polynomials up to degree \(C-1\). It is very unlikely that these higher level polynomials are modeling important trends (e.g. octic patterns) and it might make sense to place a limit on the polynomial degree. In practice, we rarely explore the effectiveness of anything more than a quadratic polynomial.

As an alternative to polynomial contrasts, one could:

  • Treat the predictors as unordered factors. This would allow for patterns that are not covered by the polynomial feature set. Of course, if the true underlying pattern is linear or quadratic, unordered dummy variables will make the model’s work more difficult.

  • Translate the ordered categories into a single set of numeric scores based on context-specific information. For example, when discussing failure modes of a piece of computer hardware, experts would be able to rank the severity of a type of failure on an integer scale. A minor failure might be scored as a “1” while a catastrophic failure mode could be given a score of “10” and so on.

Simple visualizations and context-specific expertise can be used to understand whether either of these approaches are good ideas.

5.5 Creating Features from Text Data

Often, data contain textual fields that are gathered from questionnaires, articles, reviews, tweets, and other sources. For example, the OkCupid data contains the responses to nine open text questions, such as “my self-summary” and “six things I could never do without”. The open responses are likely to contain important information relative to the outcome. Individuals in a STEM field are likely to have a different vocabulary than, say, someone in the humanities. Therefore the words or phrases used in these open text fields could be very important to predictng the outcome. This kind of data is qualitative and requires more effort to put in a form that models can consume. How then can these data be explored and represented for a model?

For example, some profile text answers contained links to external websites. A reasonable hypothesis is that the presence of a hyperlink could be related to the person’s profession. Table 5.2 shows the results for the random subset of profiles. The rate of hyperlinks in the STEM profiles was 0.21% while 11.6% of the non-STEM profiles contained at least one link. One way to evaluate a difference in two proportions is called the odds-ratio (Agresti 2012). First, the odds of an event that occurs with rate \(p\) is defined as \(p/(1-p)\). For the STEM profiles, the odds of containing a hyperlink are relatively small with a value of \(0.21/0.79 = 0.266\). For the non-STEM profiles, it is even smaller (0.132). However, the ratio of these two quantities indicates how much more likely the presence of a hyperlink would be between the two professions. In this case, the odds-ratio indicates that is is nearly 2.0-times more likely for the STEM profiles to contain a link. Basic statistics can be used to assign a lower 95% confidence interval on this quantity. For this data the lower bound is 1.8, which indicates that this increase in the link rate is unlikely to be due to random noise since it does not include a value of 1.0. Given these results, the indicator of a hyperlink will likely benefit a model and should be included.

Table 5.2: A cross-tabulation between the existence of at least one hyperlink in the OkCupid essay text and the profession.
stem other
Link 1051 582
No Link 3949 4418

Are there words or phrases that would make good predictors of the outcome? To determine this, the text data must first be processed and cleaned. Common methods for preprocessing text data are to:

  • remove commonly used stop words, such as “is”, “the”, “and”, etc.
  • stem the words so that similar words, such as the singular and plural versions, are represented as a single entity.

Christopher, Prabhakar, and Hinrich (2008) is a good introduction to the analysis of text data while Silge and Robinson (2017) is an excellent reference on the computational aspects of analyzing such data.

In these data, there are an abundance of HTML markup tags in the text, such as <br /> and <a href= \"link\">. These were removed from the data, as well as punctuation, line breaks, and other symbols. Additionally, there were some words that were nonsensical repeated characters, such as *** or aaaaaaaaaa, that were removed. At this point, there were 63,629 distinct words in the subsample of 10,000 profiles. However, commonly used words still remained. The SMART lexicon of stop words (Lewis et al. 2004) was filtered out of the current set of words, leaving 63,117 unique results. The words were then “stemmed” using the Porter algorithm (Willett 2006) to convert similar words to a common root. For example, these 7 words are fairly similar: "teach", "teacher", "teachers", "teaches", "teachable", "teaching", "teachings". Stemming would reduce these to 3 unique values: "teach", "teacher", "teachabl" Once these values were stemmed, there were 45,703 unique words remaining.

Given this set of words and their associated outcome classes, odds-ratios can be computed. First, the words were again filtered so that terms with at least 50 occurrences in the 10,000 profiles would be analyzed. This cutoff was determined so that there was a sufficient frequency for modeling. With this constraint the potential number of keywords was reduced to 3,832 terms. For each of these, the odds-ratio and associated p-value were computed. The p-value tests the hypothesis that the odds of the keyword occurring in either professional group are equal (i.e. 1.0). However, the p-values can easily provide misleading results for two reasons:

  • In isolation, the p-value only relates the the question “Is there a difference in the odds between the two groups?” The more appropriate question is “How much of a difference is there between the groups?” The p-value does not measure the magnitude of the differences.

  • When testing a single (predefined) hypothesis, the false positive rate for a hypothesis test using \(\alpha = 0.05\) is 5%. However, we are conducting 3,832 tests on the same data set which raises the false positive rate exponentially. The false-discovery rate (FDR) p-value correction (Efron and Hastie 2016) was developed for just this type of scenario. This procedure uses the entire distribution of p-values and attenuates the false positive rate. If a particular keyword has an FDR value of 0.30, this implies that the collection of keywords with FDR values less than 0.30 have collective 30% false discovery rate. For this reason, the focus here will be on the FDR values generated using the Benjamini-Hochberg correction (Benjamini and Hochberg 1995).

To characterize these results, a volcano plot is shown in Figure 5.2 where the estimated odds-ratio is shown on the x-axis and the minus log of the FDR values are shown on the y-axis (where larger values indicate higher statistical significance). The size of the points is associated with the number of events found in the sampled data set. Keywords falling in the upper left- and right-rand side would indicate a strong difference between the classes and that is unlikely to be random result. The plot shows far more keywords that have a higher likelihood to be found in the STEM profiles than in the non-STEM professions as more points fall on the upper right-hand side of one on the x-axis. Several of these have extremely high levels of statistical significance with FDR values that are vanishingly small.

Fig. 5.2: A volcano plot of the keyword analysis. Each dot represents a word from the OkCupid essays, and the size of the dot represents the frequency of the occurrence. Words in the upper right are strongly associated with STEM profiles.

As a rough criteria for “importance”“, keywords with odds-ratios of at least 2 (in either direction) and an FDR value less than \(10^{-5}\) will be considered for modeling. This results in 46 keywords:

biotech code commut compani comput
data electron ender engin feynman
firefli geek geeki googl hack
im internet lab law logic
lol math matrix mechan neal
nerd nerdi nurs program programm
robot scienc scientist silicon softwar
solv startup stephenson student teacher
tech techi technic technologi therapist

Of these keywords, only 7 were enriched in the non-STEM profiles: im, lol, student, teacher, law, therapist, nurs. Of the STEM-enriched keywords, many of these make sense (and play to stereotypes). The majority are related to occupation (e.g. engin, startup, and scienc), while others are clearly related to popular geek culture, such as firefli41, stephenson42, and scifi. There are several keywords where we might expect significant overlap, such as tech/technologi or geek/geeki. However, when generating the features for these data, a filter can be used to remove highly correlated predictor values.

Do these keywords have an impact on a model? To compare, the same logistic regression model discussed in Section 3.2.1 will be used here and will include the new set of predictors. The resampling method for comparing models described in Section 3.7 will be used to estimate the change in the resampled area under the ROC curve by adding additional predictors. Recall that the basic logistic regression model used in Section 3.2.2 had a resampled estimate of the ROC AUC of 0.789. To evaluate the keywords above, a set of binary predictors for these values were determined where a value of 1 was assigned if the keyword occurred at all in the text. The addition of 46 model terms increased the AUC to 0.842. Using the methodology discussed in Section 3.7, the 95% confidence interval on this difference was computed to be (0.051, 0.057) and this suggests that the difference is most likely real.

Alternatively, the term frequency of the number of times that the keyword appears in a profile might be more informative. However, in these counts, only 0.8% of the predictors had more than 1 instance of the keyword. The results of a model including the count predictors had the same performance as the model containing the binary variables with an AUC of 0.842.

The strategy shown here for computing and evaluating features from text is fairly simplistic and is not the only approach that could be taken. One other method for determining relevant features is the term-frequency/inverse document frequency (tf-idf) method (Amati and Van Rijsbergen 2002). Here the goal is to find words or terms that are unusual in the data at hand. For example, if the words in this book were processed, there are some that would have high frequency (e.g. predictor, encode, or resample) but are otherwise unusual in most other contexts.

For a word W in document D, the the term frequency tf is the number of times that W is contained in D. However, this should be normalized by how often the word occurs in the current collection of documents. As an example, suppose the term frequency of the word feynman in a specific profile was 2. In the sample of profiles used to derive the features, this word occurs 55 times out of 10,000. The inverse document frequency is usually represented as the log of the ratio, or \(\log_2(10000/55)\) or 7.5. The tf-idf value for feynman in this profile would be \(2\times\log_2(10000/55)\) or 15. However, suppose that in the same profile, the word internet also occurs twice. This word is more prevalent across profiles and is contained at least once 1529 times. The tf-idf value here is 5.4; in this way the same raw count is down-weighted according to its abundance in the overall data set.

One potential issue with td-idf in predictive models is the notion of the “current collection of documents”. This can be computed for the training set but what should be done when a single new sample is being predicted (i.e. there is no collection)? One approach is to use the overall idf values from the training set to weight the term frequency found in new samples.

Another method to measure text is to consider the sentiment of the words. There are curated collections of words with an assigned sentiment values. For example, “horrible” has a fairly negative connotation while “wonderful” is associated with positivity. Words can be assigned qualitative assessments (e.g. “positive”, “neutral”, etc.) or numeric scores where neutrality is given a value of zero. In some problems, sentiment might be a good predictor of the outcome.

Additionally, all of the previous approaches have considered a single word at a time. Sequences of consecutive words can also be considered and might provide additional information. n-grams are terms that are sequences of n consecutive words. One might imagine that the sequence “I hate computers” would be scored differently than the simple term “computer” when predicting whether a profile corresponds to a STEM profession.

5.6 Factors versus Dummy Variables in Tree-Based Models

As previously mentioned, certain types of models have the ability to use categorical data in its natural form (i.e. without conversions to dummy variables). A simple regression tree (Breiman et al. 1984) using the day of the week predictor for the Chicago data resulted in this simple split for prediction:

if day in {Sun, Sat} then ridership = 4.4K
 else ridership = 17.3K

Suppose the day of the week had been converted to dummy variables. What would have occurred? In this case, the model is slightly more complex since it can only create rules as a function of a single dummy variable at a time:

if day = Sun then ridership = 3.84K
  else if day = Sat then ridership = 4.96K
    else ridership = 17.30K

The non-dummy variable model could have resulted in the same structure if it had found that the results for Saturday and Sunday were sufficiently different to merit an extra set of splits. This leads to the question related to using categorical predictors in tree-based models: does it matter how the predictions are encoded?

To answer this question, a series of experiments were conducted. Several classification data sets were used to make the comparison between the different encodings. These are summarized in Table 5.3. Two data sets contained both ordered and unordered factors. As described below, the ordered factors were treated in different ways.

Table 5.3: A summary of the data sets used to evaluate encodings of categorical predictors in tree- and rule-based models.
Attrition Cars Churn German Credit HPC
n 1470 1728 5000 1000 4331
p 30 6 19 20 7
Classes 2 4 2 2 4
Numeric Predictors 16 0 15 7 5
Factor Predictors 14 6 4 13 2
Ordered Factors 7 0 0 1 0
Factor with 3+ Levels 12 7 2 11 3
Factor with 5+ Levels 3 0 1 5 2

For each iteration of the simulation, 75% of the data was used for the training set and 10-fold cross-validation was used to tune the models. When the outcome variable had two levels, the area under the ROC curve was maximized. For the other datasets, the multinomial log-likelihood was optimized. The same number of tuning parameter values were evaluated although, in some cases, the values of these parameters were different due to the number of predictors in the data before and after dummy variable generation. The same resamples and random numbers were used for the models with and without dummy variables. When the data contained a significant class imbalance, the data were downsampled to compensate.

For each data set, models were fit with the data in its original form as well as (unordered) dummy variables. For the two data sets with ordinal data, an additional model was created using ordinal dummy variables (i.e. polynomial contrasts).

Several models were fit to the data sets. Unless otherwise stated, the default software parameters were used for each model.

  • Single CART trees (Breiman et al. 1984) The complexity parameter was chosen using the “one-standard error rule” that is internal to the recursive partitioning algorithm.
  • Bagged CART trees (Breiman 1996). Each model contained 50 constituent trees.
  • Single C5.0 trees and single C5.0 rulesets (Kuhn and Johnson 2013)
  • Single conditional inference trees (Hothorn, Hornik, and Zeileis 2006). A grid of 10 values for the p-value threshold for splitting were evaluated.
  • Boosted CART trees (a.k.a. stochastic gradient boosting, Friedman (2002) ). The models were optimized over the number of trees, the learning rate, the tree depth, and the number of samples required to make additional splits. Twenty-five parameter combinations were evaluate using random search.
  • Boosted C5.0 trees. These were tuned for the number of iterations.
  • Boosted C5.0 rules. These were also tuned for the number of iterations.
  • Random forests using CART trees (Breiman 2001). Each forest contained 1,500 trees and the number of variables selected for a split was tuned over a grid of 10 values.
  • Random forests using conditional (unbiased) inference trees (Strobl et al. 2007). Each forest contained 100 trees and the number of variables selected for a split was tuned over a grid of 10 values.

For each model a variety of different performance metrics were estimated using resampling as well as the total time to train and tune the model. The programs used in the simulation are contained in the GitHub repository topepo/dummies-vs-factors.

For the three data sets with two classes, Figure 5.3 shows a summary of the results for the area under the ROC curve. In the plot, the percent difference in performance is calculated using

\[ \%Difference = \frac{Factor - Dummy}{Factor}\times 100 \]

In this way, positive values indicate that the factor encodings have better performance. The image shows the median in addition to the lower and upper 5% percentiles of the distribution. The gray dotted line indicates no difference between the encodings.

A comparison of encodings for the area under the ROC curve.

Fig. 5.3: A comparison of encodings for the area under the ROC curve.

The results of these simulations shows that, for these data sets, there is no real difference in the area under the ROC curve between the encoding methods. This also appears to be true when simple factor encodings are compared to dummy variables generated from polynomial contrasts for ordered predictors43. Of the ten models, there are only two cases out of 40 scenarios where the mainstream of the distributions did not cover zero. Stochastic gradient boosting and bagged CART trees are both ensemble methods and these showed a 2%-4% drop in the ROC curve when using factors instead of dummy variables for a single data set.

Another metric, overall accuracy, can also be assessed. These results are shown in Figure 5.4 where all of the models can be taken into account. In this case, the results are mixed. When comparing factors to unordered dummy variables, two of the models show differences in encodings. The churn data shows similar results to the ROC curve metrics. The car evaluation data demonstrates a nearly uniform effect where factor encodings do better than dummy variables. Recall that in the car data, which has four classes, all of the predictors are categorical. For this reason, it is likely to show the most effect of all of the data sets.

In the case of dummy variables generated using polynomial contrasts, neither of the data sets show a difference between the two encodings. However, the car evaluation data shows a pattern where the factor encodings had no difference compared to polynomial contrasts but when compared to unordered dummy variables, the factor encoding is superior. This indicates that the underlying trend in the data follows a polynomial pattern.

In terms of performance, it appears that differences between the two encodings are rare (but can occur). One might infer that, since the car data contains all categorical variables, that this situation would be a good indicator for when to use factors instead of dummy variables. However, two of the data sets (Attrition and German Credit), have a high proportion of categorical predictors and show no difference. In some data sets, the effect of the encoding would depend on whether the categorical predictor(s) are important to the outcome and in what way.

A comparison of encodings for the accuracy.

Fig. 5.4: A comparison of encodings for the accuracy.

In summary, while few differences were seen, it is very difficult to predict when a difference will occur.

However, one other statistic was computed for each of the simulations: the time to train the models. Figure 5.5 shows the speed-up of using factors above and beyond dummy variables (i.e. a value of 2.5 indicates that dummy variable models are two and a half times slower than factor encoding models). Here, there is very strong trend that factor-based models are more efficiently trained than their dummy variable counterparts. The reason for this is likely to be that the expanded number of predictors (caused by generating dummy variables) requires more computational time than the method for determining the optimal split of factor levels. The exceptions to this trend are the models using conditional inference trees.

A comparison of encodings for time to train the model. Large values indicate that the factor encoding took less time to train than the model with dummy variables.

Fig. 5.5: A comparison of encodings for time to train the model. Large values indicate that the factor encoding took less time to train than the model with dummy variables.

For a guideline, we suggest using the predictors without converting to dummy variables and, if the model appears promising, to also try refitting using dummy variables.

5.7 Computing

R programs for reproducing these analysis can be found at https://github.com/topepo/TBD.

Feature Selection

Haase, Richard F. 2011. Multivariate General Linear Models. Sage.

Timm, Neil H, and James E Carlson. 1975. “Analysis of Variance Through Full Rank Models.” Multivariate Behavioral Research Monographs. Society of Multivariate Experimental Psychology.

Preneel, Bart. 2010. “Cryptographic Hash Functions: Theory and Practice.” In ICICS, 1–3.

Weinberger, Kilian, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. “Feature Hashing for Large Scale Multitask Learning.” In Proceedings of the 26th Annual International Conference on Machine Learning, 1113–20. ACM.

Box, George EP, William Gordon Hunter, and J Stuart Hunter. 2005. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. Wiley.

Agresti, Alan. 2012. Categorical Data Analysis. Wiley-Interscience.

Christopher, D Manning, Raghavan Prabhakar, and Schutze Hinrich. 2008. Introduction to Information Retrieval. Cambridge University Press.

Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly.

Lewis, David D, Yiming Yang, Tony G Rose, and Fan Li. 2004. “Rcv1: A New Benchmark Collection for Text Categorization Research.” Journal of Machine Learning Research 5:361–97.

Willett, Peter. 2006. “The Porter Stemming Algorithm: Then and Now.” Program 40 (3):219–23.

Efron, Bradley, and Trevor Hastie. 2016. Computer Age Statistical Inference. Cambridge University Press.

Benjamini, Yoav, and Yosef Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 289–300.

Amati, Gianni, and Cornelis Joost Van Rijsbergen. 2002. “Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness.” ACM Transactions on Information Systems 20 (4):357–89.

Breiman, L., J. Friedman, R. Olshen, and C. Stone. 1984. Classification and Regression Trees. New York: Chapman; Hall.

Breiman, Leo. 1996. “Bagging Predictors.” Machine Learning 24 (2). Springer:123–40.

Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.

Hothorn, Torsten, Kurt Hornik, and Achim Zeileis. 2006. “Unbiased Recursive Partitioning: A Conditional Inference Framework.” Journal of Computational and Graphical Statistics 15 (3). Taylor & Francis:651–74.

Friedman, Jerome H. 2002. “Stochastic Gradient Boosting.” Computational Statistics & Data Analysis 38 (4). Elsevier:367–78.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1). Springer:5–32.

Strobl, Carolin, Anne-Laure Boulesteix, Achim Zeileis, and Torsten Hothorn. 2007. “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.” BMC Bioinformatics 8 (1):25.

  1. These computations use the “MurmurHash3” hash found at https://github.com/aappleby/smhasher and implemented in the FeatureHashing R package.

  2. Traditional analysis of alias structures could be applied to this problem. However, more research would be needed since those methods are usually applied to balanced designs where the values of the predictors have the same prevalence. Also, as previously mentioned, they tend to focus on aliasing between multiple variables. In any case, there is much room for improvement on this front.

  3. https://en.wikipedia.org/wiki/Firefly_(TV_series)

  4. https://en.wikipedia.org/wiki/Neal_Stephenson

  5. Note that many of the non-ensemble methods, such as C5.0 trees/rules, CART, and conditional inference trees, show a significant amount of variation. This is due to the fact that they are unstable models with high variance.