5 Encoding Categorical Predictors
Categorical or nominal predictors are those that contain qualitative data. For the OkCupid data, examples include education (e.g. high school, two year college, college, etc.) and diet (e.g. anything, vegan, vegetarian, etc.). In the Ames data, the type of house and neighborhood are predictors that have no numeric scale. However, while numeric, the ZIP code also qualifies as a qualitative predictor because the numeric values have no continuous meaning. Simply put, there is no reason that a zip code of 21212 is 14,827 “more” than a zip code of 06385.
Categorical predictors also can be derived from unstructured or open text. The OkCupid data contains sets of optional essays that describe individuals’ interests and other personal information. Predictive models that depend on numeric inputs cannot directly handle open text fields. Instead, these information rich data need to be processed prior to presenting the information to a model. Text information can be processed in different ways. For example, if keywords are extracted from text to use as predictors, then should these include single words or strings of words? Should words be stemmed so that a root word is used (e.g. “comput” being short for “computer”, “computers”, “computing”, “computed” etc.) or should a regular expression pattern matching approach be used (e.g. ^comput
)?
Simple categorical variables can also be classified as ordered or unordered. A variable with values “Bad”, “Good”, and “Better” shows a clear progression of values. While the difference between these categories may not be precisely numerically quantifiable, there is a meaningful ordering. To contrast, consider another variable that takes values of “French”, “Indian”, or “Peruvian”. These categories have no meaningful ordering. Ordered and unordered factors might require different approaches for including the embedded information in a model.
As with other preprocessing steps, the approach to including the predictors depends on the type of model. A large majority of models require that all predictors be numeric. There are, however, some exceptions. Algorithms for treebased models can naturally handle splitting numeric or categorical predictors. These algorithms employ a series if/then
statements that sequentially split the data into groups. For example, in the Chicago data, the day of the week is a strong predictor and a treebased model would likely include a model component such as
if day in {Sun, Sat} then ridership = 4.4K
else ridership = 17.3K
As another example, a naive Bayes model can create a crosstabulation between a categorical predictor and the outcome class and this frequency distribution is factored into the model’s probability calculations. In this situation, the categories can be processed in their natural format. The final section of this chapter investigates how categorical predictors interact with treebased models.
Treebased and naive Bayes models are exceptions; most models require that the predictors take numeric form. This chapter focuses primarily on methods that encode categorical data to numeric values.
Many of the analyses in this chapter use the OkCupid data that were introduced in Section 3.1 and discussed in the previous chapter. In this chapter, we will explore specific variables in more detail. One issue in doing this is related to overfitting. Compared to other data sets discussed in this book, the OkCupid data set contains a large number of potential categorical predictors, many of which have a low prevalence. It would be problematic to take the entire training set, examine specific trends and variables, then add features to the model based on these analyses. Even though we crossvalidate the models on these data, we may overfit and find predictive relationships that are not generalizable to other data. The data set is not small; the training set consists of 38,809 data points. However, for the analyses here, a subsample of 5,000 STEM profiles and 5,000 nonSTEM profiles were selected from the training set for the purpose of discovering new predictors for the models.
5.1 Creating Dummy Variables for Unordered Categories
The most basic approach to representing categorical values as numeric data is to create dummy variables. These are artificial numeric variables that capture some aspect of one (or more) of the categorical values. There are many methods for doing this and, to illustrate, let’s first consider a simple example for the day of the week. If we take the seven possible values and convert them into binary dummy variables, the mathematical function required to make the translation is often referred to as a contrast or parameterization function. An example of a contrast function is called the “reference cell” or “treatment” contrast where one of the values of the predictor is left unaccounted for in the resulting dummy variables. Using Sunday as the reference cell, the contrast function would create six dummy variables:
Mon  Tues  Wed  Thurs  Fri  Sat  

Sun  0  0  0  0  0  0 
Mon  1  0  0  0  0  0 
Tues  0  1  0  0  0  0 
Wed  0  0  1  0  0  0 
Thurs  0  0  0  1  0  0 
Fri  0  0  0  0  1  0 
Sat  0  0  0  0  0  1 
These six numeric predictors would take the place of the original categorical variable.
Why only six? There are two related reasons. First, if you know the values of the six dummy variables, you can directly infer the seventh. The second reason is more technical. When fitting linear models, the design matrix \(X\) is created. When the model has an intercept, an additional initial column of ones for all rows is included. Estimating the parameters for a linear model (as well as other similar models) involve inverting the matrix \((X'X)\). If the model includes an intercept and contains dummy variables for all seven days, then the seven day columns would add up (rowwise) to the intercept and this linear combination would prevent the matrix inverse from being computed (as it is singular). When this occurs, the design matrix said to be less than full rank or overdetermined. When there are \(C\) possible values of the predictor and only \(C1\) dummy variables are used, the matrix inverse can be computed and the contrast method is said to be a full rank parameterization (Haase 2011).
How would we interpret these dummy variables? That depends on what type of model is being used but consider a linear model for the Chicago transit data that only uses the day of the week in the model with the reference cell parameterization above. Using the training set to fit the model, the intercept value estimates the mean of the reference cell, which is the average number of Sunday riders in the training set, and was estimated to be 3.84K people. The second model parameter, for Monday, is estimated to be 12.61K. In the reference cell model, the dummy variables represent the mean value above and beyond the reference cell mean. In this case, estimate indicates that there were 12.61K more riders on Monday than Sunday. The overall estimate of Monday ridership adds the estimates from the intercept and dummy variable (16.45K rides).
When there are more than one categorical predictors, the reference cell becomes multidimensional. Suppose there were a predictor for the weather that has only a few values: “clear”, “cloudy”, “rain”, and “snow”. Let’s consider “clear” to be the reference cell. If this variable were included in the model, the intercept would correspond to the mean of the Sundays with a clear sky. The interpretation of each set of dummy variables does not change. The average ridership for a cloudy Monday would augment the average clear Sunday ridership with the average incremental effect of cloudy and the average incremental effect of Monday.
There are other contrast functions for dummy variables. The “cell means” parameterization (Timm and Carlson 1975) would create a dummy variable for each day of the week and would not include an intercept to avoid the problem of singularity of the design matrix. In this case, the estimates for each dummy variable would correspond to the average value for that category. For the Chicago data, the parameter estimate for the Monday dummy variable would simply be 16.45K.
There are several other contrast methods for constructing dummy variables. Another, based on polynomial functions, is discussed below for ordinal data.
5.2 Encoding Predictors with Many Categories
If there are \(C\) categories, what happens when \(C\) becomes very large? For example, ZIP code in the United States may be an important predictor for outcomes that are affected by a geographic component. There are more than 40K possible ZIP codes and, depending on how the data are collected, this might produce an overabundance of dummy variables (relative to the number of data points). As mentioned in the previous section, this can cause the data matrix to be overdetermined and restrict the use of certain models. Also, ZIP codes in highly populated areas may have a higher rate of occurrence in the data, leading to a “long tail” of locations that are infrequently observed.
One potential issue is that resampling might exclude some of the more rare categories from the analysis set. This would lead to dummy variable columns in the data that contain all zeros and, for many models, this would become a numerical issue that will cause an error. Moreover the model will not be able to provide a relevant prediction for new samples that contain this predictor. When a predictor contains a single value, we call this a zerovariance predictor because there truly is no variation displayed by the predictor.
The first way to handle this issue is to create the full set of dummy variables and simply remove the zerovariance predictors. This is a simple and effective approach but it may be difficult to know a priori what terms will be in the model. In other words, during resampling, there may be a different number of model parameters across resamples. This can be a good sideeffect since it captures the variance caused by omitting rarely occurring values and propagates this noise into the resampling estimates of performance.
Prior to producing dummy variables, one might consider determining which (if any) of these variables are nearzero variance predictors or have the potential to have near zero variance during the resampling process. These are predictors that have few unique values (such as two values for binary dummy variables) and occur infrequently in the data. For the training set, we would consider the ratio of the frequencies of the most commonly occurring value to the second most commonly occurring. For dummy variables, this is simply the ratio of the numbers of ones and zeros. Suppose that a dummy variable has 990 values of zero and 10 ones. The ratio of these frequencies, 99, indicates that it could easily be converted into all zeros during resampling. We suggest a rough cutoff of 19 to declare such a variable “too rare” although the context may raise or lower this bar for different problems.
Although nearzero variance predictors likely contain little valuable predictive information, we may not desire to filter these out. One way to avoid filtering these predictors is to redefine the predictor’s categories prior to creating dummy variables. Instead we can create an “other” category that pools the rarely occurring categories, assuming that such a pooling is sensible. A cutoff based on the training set frequency can be specified that indicates which categories should be combined. Again, this should occur during resampling so that the final estimates reflect the variability in which predictor values are included in the “other” group.
Another way to combine categories is to use a hashing function (or hashes). Hashes are used to map one set of values to another set of values and are typically used in databases and cryptography (Preneel 2010). The original values are called the keys and are typically mapped to a smaller set of artificial hash values. In our context, a potentially large number of predictor categories are the keys and we would like to represent them using a smaller number of categories (i.e. the hashes). The number of possible hashes is set by the user and, for numerical purposes, is a power of 2. Some computationally interesting aspects to hash functions are
 The only data required is the value being hashed and the resulting number of hashes. This is different from the previously described contrast function using a reference cell.
 The translation process is completely deterministic.
 In traditional applications, it is important that the number of keys that are mapped to hashes is relatively uniform. This would be important if the hashes are used in cryptography since it increases the difficulty of guessing the result. As discussed below, this might not be a good characteristic for data analysis.
 Hash functions are unidirectional; once the hash values are created, there is no way of knowing the original values. If there are a known and finite set of original values, a table can be created to do the translation but, otherwise, the keys are indeterminable when only the hash value is known.
 There is no free lunch when using this procedure; some of the original categories will be mapped to the same hash value (called a “collision”). The number of collisions will be largely determined by the number of features that are produced.
When using hashes to create dummy variables, the procedure is called “feature hashing” or the “hash trick” (Weinberger et al. 2009). Since there are many different hashing functions, there are different methods for producing the table that maps the original set to the reduced set of hashes.
As a simple example, the locations from the OkCupid data were hashed^{40}. Table 5.1 shows 7 of these cities to illustrate the details.
Hash Integer Value  16 cols  256 cols  

belvedere tiburon  582753783  8  248 
berkeley  1166288024  9  153 
martinez  157684639  2  98 
mountain view  1267876914  15  207 
san leandro  1219729949  14  30 
san mateo  986716290  3  131 
south san francisco  373608504  9  201 
The hash value is an integer value derived solely from each city’s text value. In order to convert this to a new feature in the data set, modular arithmetic is used. Suppose that 16 features are desired. To create a mapping from the integer to one of the 16 feature columns, integer division is used on the hash value using the formula \(column = (integer\mod 16) + 1\). For example, for the string “mountain view”, we have \((1267876914 \mod 16) + 1 = 15\), so this city is mapped to the fifteenth dummy variable feature. The last column above shows the assignments for the cities when 256 feature are requested; \(\mod 256\) has more possible remainders than \(\mod 16\) so more features are possible. With 16 features, the dummy variable assignments for a selection of locations in the data are:
1  2  3  4  5  6  9  12  13  14  15  16  

alameda  1  0  0  0  1  0  0  0  0  0  0  0 
belmont  1  0  0  0  0  0  0  0  0  0  1  0 
benicia  1  0  0  1  0  0  0  0  0  0  0  0 
berkeley  1  0  0  0  0  0  1  0  0  0  0  0 
castro valley  1  0  0  0  0  0  0  1  0  0  0  0 
daly city  1  0  0  0  0  0  0  0  0  0  1  0 
emeryville  1  0  0  0  0  0  0  0  1  0  0  0 
fairfax  1  1  0  0  0  0  0  0  0  0  0  0 
martinez  1  1  0  0  0  0  0  0  0  0  0  0 
menlo park  1  0  0  0  0  0  0  0  0  1  0  0 
mountain view  1  0  0  0  0  0  0  0  0  0  1  0 
oakland  1  0  0  0  0  0  0  1  0  0  0  0 
other  1  0  0  0  0  0  1  0  0  0  0  0 
palo alto  1  1  0  0  0  0  0  0  0  0  0  0 
san francisco  1  0  0  0  0  1  0  0  0  0  0  0 
san leandro  1  0  0  0  0  0  0  0  0  1  0  0 
san mateo  1  0  1  0  0  0  0  0  0  0  0  0 
san rafael  1  0  0  0  0  0  0  0  0  0  0  1 
south san francisco  1  0  0  0  0  0  1  0  0  0  0  0 
walnut creek  1  1  0  0  0  0  0  0  0  0  0  0 
There are several characteristics of this table to notice. First there are 40 hashes (columns) that are not shown in this table since they contain all zeros. This occurs because none of the hash integer values had those specific mod 16 remainders. This could be a problem for predictors that have the chance to take different string values than the training data and would be placed in one of these columns. In this case, it would be wise to create an ‘other’ category to ensure that new strings have a hashed representation. Next notice that there were 6 hashes with no collisions (i.e. have a single 1 in their column). But several columns exhibit collisions. An example of a collision is hash number 14 which encodes for both Menlo Park and San Leandro. In statistical terms, these two categories are said to be aliased or confounded, meaning that a parameter estimated from this hash cannot isolate the effect of either city. Alias structures have long been studied in the field of statistical experimental design although usually in the context of aliasing between variables rather than within variables (Box, Hunter, and Hunter 2005). Regardless, from a statistical perspective, minimizing the amount and degree of aliasing is an important aspect of a contrast scheme.
Some hash functions can also be signed, meaning that instead of producing a binary indicator for the new feature, possible values could be 1, 0, or +1. A zero would indicate that the category is not associated with that specific feature and the \(\pm 1\) values can indicate different values of the original data. Suppose that two categories map to the same hash value. A binary hash would result in a collision but a signed hash could avoid this by encoding one category as +1 and the other as 1. Here are the same data as above with a signed hash of the same size:
1  2  3  4  5  6  9  12  13  14  15  16  

alameda  1  0  0  0  1  0  0  0  0  0  0  0 
belmont  1  0  0  0  0  0  0  0  0  0  1  0 
benicia  1  0  0  1  0  0  0  0  0  0  0  0 
berkeley  1  0  0  0  0  0  1  0  0  0  0  0 
castro valley  1  0  0  0  0  0  0  1  0  0  0  0 
daly city  1  0  0  0  0  0  0  0  0  0  1  0 
emeryville  1  0  0  0  0  0  0  0  1  0  0  0 
fairfax  1  1  0  0  0  0  0  0  0  0  0  0 
martinez  1  1  0  0  0  0  0  0  0  0  0  0 
menlo park  1  0  0  0  0  0  0  0  0  1  0  0 
mountain view  1  0  0  0  0  0  0  0  0  0  1  0 
oakland  1  0  0  0  0  0  0  1  0  0  0  0 
other  1  0  0  0  0  0  1  0  0  0  0  0 
palo alto  1  1  0  0  0  0  0  0  0  0  0  0 
san francisco  1  0  0  0  0  1  0  0  0  0  0  0 
san leandro  1  0  0  0  0  0  0  0  0  1  0  0 
san mateo  1  0  1  0  0  0  0  0  0  0  0  0 
san rafael  1  0  0  0  0  0  0  0  0  0  0  1 
south san francisco  1  0  0  0  0  0  1  0  0  0  0  0 
walnut creek  1  1  0  0  0  0  0  0  0  0  0  0 
Notice that the collision in hash 14 is now gone since a value of 1 corresponds to Menlo Park and San Leandro is encoded as 1. For a linear regression model, the coefficient for the 14th hash now represents the positive or negative shift in the response due to Menlo Park or San Leandro, respectively.
What level of aliasing should we expect? To investigate, data were simulated so that there were \(10^4\) predictor categories and each category consisted of 20 unique, randomly generated character values. Binary and signed hashes of various sizes were generated and the collision rate was determined. This simulation was done 10 different times and the average percent collision was calculated. Figure 5.1 shows the results where the yaxis is the rate of any level of collision in the features. As expected, the signed values are associated with less collisions than the binary encoding. The green line corresponds to the full set of dummy variables (without using hashes). For these data, using signed features would give the best aliasing results but the percentage of collisions is fairly high.
Because of how the new features are created, feature hashing is oblivious to the concept of predictor aliasing. Uniformity of the hash function is an important characteristic when hashes are used in their typical applications. However, it may be a liability in data analysis. The reduction of aliasing is an important concept in statistics; we would like to have the most specific representations of categories as often as possible for a few different reasons:
 Less aliasing allows for better interpretation of the results. If a predictor has a significant impact on the model, it is probably a good idea understand why. When multiple categories are aliased due to a collision, untangling the true effect may be very difficult. Although this may help us narrow down which categories are affecting the response.
 Categories involved in collisions are not related in any meaningful way. For example, San Leandro and Menlo Park were not aliased because they share a particular similarity or geographic closeness. Because of the arbitrary nature of the collisions, it is possible to have different categories whose true underlying effect are counter to one another. This might have the effect of negating the impact of the hashed feature.
 Hashing functions have no notion of the probability that each key will occur. As such, it is conceivable that a category that occurs with great frequency is aliased with one that is rare. In this case, the more abundant value will have a much larger influence on the effect of that hashing feature.
Although we are not aware of any statisticallyconscious competitor to feature hashing, there are some characteristics that a contrast function should have when used with a predictor containing a large number of categories^{41}. For example, if there is the capacity to have some categories unaliased with any others, the most frequently occurring values should be chosen for these roles. The reason for this suggestion is that it would generate a high degree of specificity for the categories that are most likely to be exposed to the contrast function. One side effect of this strategy would be to alias less frequently occurring values to one another. This might be beneficial since it reduces the sparsity of those dummy variables. For example, if two categories that each account for 5% of the training set values are combined, the dummy variable has a twofold increase in the number of ones. This would reduce the chances of being filtered out during resampling and might also lessen the potential effect that such a nearzero variance predictor might have on the analysis.
5.3 Approaches for Novel Categories
Suppose that we have built a model to predict the probability that an individual works in a STEM profession and that this model depends on geographic location (e.g. city). The model will be able to predict the probability of STEM profession if a new individual lives in one of the 20 cities. But what happens to the model prediction when a new individual lives in a city that is not represented in the original data? if the models are solely based on dummy variables, then the models will not have seen this information and will not be able to generate a prediction.
If there is a possibility of encountering a new category in the future, one strategy would be to use the previously mentioned “other” category to capture new values. While this approach may not be the most effective at extracting predictive information relative to the response for this specific category, it does enable the original model to be used without completely refitting it. This approach can also be used with feature hashing since the new categories are converted into a unique “other” category that is hashed in the same manner as the rest of the predictor values; however, we do need to ensure that the “other” category is present in the training/testing data. Alternatively we could ensure that the “other” category collides with another hashed category so that the model can be used to predict a new sample. More approaches to novel categories are given in the next section.
Note that the concept of a predictor with novel categories is not an issue in the training/testing phase of a model. In this phase the categories for all predictors in the existing data are known at the time of modeling. If a specific predictor category is present in only the training set (or viceversa), a dummy variable can still be created for both data sets although it will be an zerovariance predictor in one of them.
5.4 Supervised Encoding Methods
There are several methods of encoding categorical predictors to numeric columns using the outcome data as a guide (so that they are supervised methods). These techniques are well suited to cases where the predictor has many possible values or when new levels appear after model training.
The first method is a simple translation that is sometimes called effect or likelihood encoding (MicciBarreca 2001,Zumel and Mount (2016)). In essence, the effect of the factor level on the outcome is measured and this effect is used as the numeric encoding. For example, for the Ames housing data, we might calculate the mean or median sale price of a house for each neighborhood from the training data and use this statistic to represent the factor level in the model. There are a variety of ways to estimate the effects and these will be discussed below.
For classification problems, a simple logistic regression model can be used to measure the effect between the categorical outcome and the categorical predictor. If the outcome event occurs with rate \(p\), the odds of that event is is defined as \(p/(1p)\). As an example, with the OkC data, the rate of STEM profiles in Mountain View California is 0.51 so that the odds would be 0.03. Logistic regression models the logodds of the outcome as a function of the predictors. If we only include a single categorical predictor in the model, we can estimate the logodds for each predictor value and use this as the encoding. Using the same location previously shown in Table 5.1, the simple effect encodings for the OkC data are shown in Table 5.2 under the heading of “Raw”.
Location  Rate  n  Raw  Shrunk  Feat 1  Feat 2  Feat 3 

belvedere tiburon  0.081  37  2.428  2.107  0.007  0.050  0.000 
berkeley  0.164  2729  1.630  1.635  0.122  0.051  0.052 
martinez  0.074  204  2.534  2.393  0.010  0.092  0.049 
mountain view  0.508  264  0.030  0.065  0.080  0.325  0.002 
san leandro  0.131  427  1.891  1.885  0.056  0.048  0.014 
san mateo  0.292  861  0.888  0.907  0.143  0.285  0.015 
south san francisco  0.181  254  1.509  1.536  0.006  0.036  0.034 
<new location>  1.820  0.050  0.039  0.018 
As previously mentioned, there are different methods for estimating the effects. A single generalized linear model (e.g. linear or logistic regression) can be used. While very fast, it has drawbacks. For example, what happens when a factor level has a single value? Theoretically, the logodds should be infinite in the appropriate direction but, numerically, it is usually capped at a large (and inaccurate) value.
One way around this issue is to use some type of shrinkage method. For example, the overall logodds can be determined and, if the quality of the data within a factor level is poor, we can bias that level’s effect estimate towards an overall estimate that disregards the levels of the predictor. “Poor quality” could be due to a small sample size or, for numeric outcomes, a large variance within the data for that level. Shrinkage methods can also move extreme estimates towards the middle of the distribution. For example, a well replicated factor level might have an extreme mean or logodds (as will be seen for Mountain View data below).
A common method shrinking parameter estimates is Bayesian analysis (McElreath 2015). In this case, before seeing the data, we can use our expert judgement to specify a prior distribution for our estimates (e.g. the means or logodds). The prior would be a theoretical distribution that represents the overall distribution of effects. Almost any distribution can be used. If there is not strong belief as to what this distribution should be is may be enough to focus on the shape. If we think that a bellshaped distribution is reasonable but don’t have much more idea than that, a very diffuse or wide prior can be used so that it is not overly opinionated. Bayesian methods take the observed data and blend it with the prior distribution to come up with a posterior distribution that is a combination of the two. For categorical predictor values with poor data, the posterior estimate is shrunken closer to the center of the prior distribution. This can also occur when their raw estimates are relatively extreme^{42}.
For the OkC data, a normal prior distribution was used for the logodds that has a large standard deviation (\(\sigma = 10\)).
Table 5.2 shows the results. For most locations, the raw and shrunken estimates were very similar. For Belvedere/Tiburon, there was a relatively small sample size (and a relatively small STEM rate) and the estimate was shrunken towards the center of the prior. Note that there is a row for “<new location>
”. When a new location is given to the Bayesian model, the procedure used here estimates its effect as the most likely value using the mean of the posterior distribution (a value of 1.82 for these data). A plot of the raw and shrunken effect estimates is also shown in Figure 5.2(a). Panel (b) shows the magnitude of the estimates (xaxis) versus the difference between the two. As the raw effect estimates dip lower than 2, the shrinkage pulled the values towards the center of the estimates. For example, the location with the largest logodds value (Mountain View) was reduced from 0.03 to 0.065.
Empirical Bayes methods can also be used, in the form of linear (and generalized linear) mixed models (West and Galecki 2014). These are nonBayeisan estimation methods that also incorporate shrinkage in their estimation procedures. While they tend to be faster than the types of Bayesian models used here, they offer less flexibility than a fully Bayesian approach.
One issue with effect encoding, independent of the estimation method, is that it increases the possibility of overfitting. We are taking the estimated effects from one model and putting them in another model (as variables). If these two models are based on the same data, it is somewhat of a selffulfilling prophecy. If these encodings are not consistent with future data, this will result in overfitting that the model cannot detect since it is not exposed to any other data that might contradict the findings . Also, the use of summary statistics as predictors can drastically underestimate the variation in the data and might give a falsely optimistic opinion of the utility of the new encoding column^{43}. As such, it is strongly recommended that either a different data sets be used to estimate the encodings and the predictive model or that their derivation is conducted inside resampling so that the assessment set can measure the overfitting (if it exists).
Another supervised approach comes from the deep learning literature on the analysis of textual data. In their case, large amounts of text can be cut up into individual words. Rather than making each of these words into its own indicator variable, word embedding or entity embedding approaches have been developed. Similar to the dimension reduction methods described in the next chapter, the idea is to estimate a smaller set of numeric features that can be used to adequately represent the categorical predictors. Guo and Berkhahn (2016) and Chollet and Allaire (2018) describe this technique in more detail. In addition to the dimension reduction, there is the possibility that these methods can estimate semantic relationships between words so that words with similar themes (e.g. “dog”, “pet”, etc.) have similar values in the new encodings. This technique is not limited to text data and can be used to encode any type of qualitative variable.
Once the number of new features are specified, the model takes the traditional indicators variables and randomly assigns them to one of the new features. The model then tries to optimize both the allocation of the indicators to features as well as the parameter coefficients for the features themselves. The outcome in the model can be the same as the predictive model (e.g. sale price or the probability of a STEM profile). Any type of loss function can be optimized but it is common to use the root mean squared error for numeric outcomes and crossentropy for categorical outcomes (which are the loss functions for linear and logistic regression, respectively). Once the model is fit, the values of the embedding features are saved for each observed value of the quantitative factor. These values serve as a lookup table that is used for prediction. Additionally, an extra level can be allocated to the original predictor to serve as a placeholder for any new values for the predictor that are encountered after model training.
A more typical neural network structure can be used that places one or more set of nonlinear variables inbetween the predictors and outcomes (i.e. the hidden layers). This allows the model to make more complex representations of the underlying patterns. While a complete description of neural network models are outside of the scope of this book, Chollet and Allaire (2018) is a very accessible guide to fitting these models and specialized software exists to create the encodings.
For the OkCupid data, there are 52 locations in the training set. We can try to represent these locations, plus a slot for potential new locations, using a set embedding features. A schematic for the model is:
Here, each green node represents the part of the network that is dedicated to reducing the dimensionality of the location data to a smaller set. In this diagram, three embedding features are used. The model can also include a set of other predictors unrelated to the embedding components. In this diagram an additional set of indicator variables, derived later in Section 5.6, is shown in orange. This allows the embeddings to be estimated in the presence of other potentially important predictors so that they can be adjusted accordingly. Each line represents a slope coefficient for the model. The connections between the indicator variable layer on the left and the embedding layer, represented by the green circular nodes, are the values that will be used to represent the original locations. The activation function connecting the embeddings and other predictors to the hidden layer used ReLU connections (introduced in Section 6.2.1). This allows the model additional complexity using nonlinear relationships. Here, ten hidden nodes were used during model estimation. In total, the network model estimated 678 parameters in the course of deriving the embedding features. Crossentropy was used to fit the model and it was trained for 30 iterations, where it was found to converge. These data are not especially complex since there are few unique values that tend to have very little textual information since each data point only belongs to a single location (unlike the case where text is cut into individual words).
Table 5.2 shows the results for several locations and Figure 5.3 contains the relationship between each of the resulting encodings (i.e. features) and the raw logodds. Each of the features has a relationship with the outcome, with rank correlations of 0.46, 0.76, and 0.18. Some of the new features are correlated with the others; absolute correlation values range between 0 and 0.76. This raises the possibility that, for these data, fewer features might be be required. Finally, as a reminder, while an entire supervised neural network was used here, the embedding features can be used in other models for the purpose of preprocessing the location values.
5.5 Encodings for Ordered Data
As we saw from the previous sections an unordered predictor with \(C\) categories can be represented by \(C1\) binary dummy variables or a hashed version of binary dummy variables. These methods effectively present the categorical information to the models. But now suppose that the \(C\) categories have a relative ordering. For example, consider a predictor that has the categories of “low”, “medium”, and “high.” We could convert this information to 2 binary predictors: \(X_{low} = 1\) for samples categorized as “low” and = 0 otherwise, and \(X_{medium} = 1\) for samples categorized as “medium” and = 0 otherwise. These new predictors would accurately identify low, medium, and high samples, but the two new predictors would miss the information contained in the relative ordering which could be very important relative to the response.
Ordered categorical predictors would need a different way to be presented to a model to uncover the relationship of the order to the response. Ordered categories may have a linear relationship with the response. For instance we may see an increase of approximately 5 units in the response as we move from low to medium and an increase of approximately 5 units in the response as we move from medium to high. To allow a model to uncover this relationship we must present the model with a numeric encoding of the ordered categories that represents a linear ordering. In the field of Statistics this type of encoding is referred to as a polynomial contrast. A contrast has the characteristic that it is a single comparison (i.e. one degree of freedom) and its coefficients sum to zero. For the “low”, “medium”, “high” example above the contrast to uncover a linear trend would be 0.71, 0, 0.71, where low samples encode to 0.71, medium samples to 0, and high samples to 0.71. Polynomial contrasts can extend to nonlinear shapes, too. If the relationship between the predictor and the response was best described by a quadratic trend, then the contrast would be 0.41, 0.82, and 0.41 (Table 5.3). Conveniently, these types of contrasts can be generated for predictors with any number of ordered factors, but the complexity of the contrast is constrained to one less than the number of categories in the original predictor. For instance, we could not explore a cubic relationship between a predictor and the response with a predictor that has only 3 categories.
By employing polynomial contrasts, we can investigate multiple relationships (linear, quadratic, etc.) simultaneously by including these in the same model. When we do this, the form of the numeric representation is important. Specifically we desire the new predictors to contain unique information. To do this the numeric representations are required to be orthogonal. This means that the dot product of the contrast vectors is 0.
Linear  Quadratic  

low  0.71  0.41 
medium  0.00  0.82 
high  0.71  0.41 
It is important to recognize that patterns described by polynomial contrasts may not effectively relate a predictor to the response. For example, in some cases, one might expect a trend where “low” and “middle” samples have a roughly equivalent response but “high” samples have a much different response. In this case, polynomial contrasts are unlikely to be effective at modeling this trend. Another downside to polynomial contrasts for ordered categories occurs when there are a moderate to high number of categories. If an ordered predictor has \(C\) levels, the encoding into dummy variables uses polynomials up to degree \(C1\). It is very unlikely that these higher level polynomials are modeling important trends (e.g. octic patterns) and it might make sense to place a limit on the polynomial degree. In practice, we rarely explore the effectiveness of anything more than a quadratic polynomial.
As an alternative to polynomial contrasts, one could:
Treat the predictors as unordered factors. This would allow for patterns that are not covered by the polynomial feature set. Of course, if the true underlying pattern is linear or quadratic, unordered dummy variables will make the model’s work more difficult.
Translate the ordered categories into a single set of numeric scores based on contextspecific information. For example, when discussing failure modes of a piece of computer hardware, experts would be able to rank the severity of a type of failure on an integer scale. A minor failure might be scored as a “1” while a catastrophic failure mode could be given a score of “10” and so on.
Simple visualizations and contextspecific expertise can be used to understand whether either of these approaches are good ideas.
5.6 Creating Features from Text Data
Often, data contain textual fields that are gathered from questionnaires, articles, reviews, tweets, and other sources. For example, the OkCupid data contains the responses to nine open text questions, such as “my selfsummary” and “six things I could never do without”. The open responses are likely to contain important information relative to the outcome. Individuals in a STEM field are likely to have a different vocabulary than, say, someone in the humanities. Therefore the words or phrases used in these open text fields could be very important to predictng the outcome. This kind of data is qualitative and requires more effort to put in a form that models can consume. How then can these data be explored and represented for a model?
For example, some profile text answers contained links to external websites. A reasonable hypothesis is that the presence of a hyperlink could be related to the person’s profession. Table 5.4 shows the results for the random subset of profiles. The rate of hyperlinks in the STEM profiles was 21% while 11.6% of the nonSTEM profiles contained at least one link. One way to evaluate a difference in two proportions is called the oddsratio (Agresti 2012). First, the odds of an event that occurs with rate \(p\) is defined as \(p/(1p)\). For the STEM profiles, the odds of containing a hyperlink are relatively small with a value of \(0.21/0.79 = 0.266\). For the nonSTEM profiles, it is even smaller (0.132). However, the ratio of these two quantities can be used to understand the effect of having a hyperlink would be between the two professions. In this case, this indicates that the odds of a STEM profile is is nearly 2.0times higher when the profile to contains a link. Basic statistics can be used to assign a lower 95% confidence interval on this quantity. For this data the lower bound is 1.8, which indicates that this increase in the odds is unlikely to be due to random noise since it does not include a value of 1.0. Given these results, the indicator of a hyperlink will likely benefit a model and should be included.
stem  other  

Link  1051  582 
No Link  3949  4418 
Are there words or phrases that would make good predictors of the outcome? To determine this, the text data must first be processed and cleaned. At this point, there were 63,630 distinct words in the subsample of 10,000 profiles. A variety of features were computed on the data such as the number of commas, hashtags, mentions, exclamation points, and so on. This resulted in a set of 14 new “text related” features. In these data, there are an abundance of HTML markup tags in the text, such as <br />
and <a href= \"link\">
. These were removed from the data, as well as punctuation, line breaks, and other symbols. Additionally, there were some words that were nonsensical repeated characters, such as ***
or aaaaaaaaaa
, that were removed. Additional types of preprocessing of text are described below and, for more information, Christopher, Prabhakar, and Hinrich (2008) is a good introduction to the analysis of text data while Silge and Robinson (2017) is an excellent reference on the computational aspects of analyzing such data.
Given this set of 63,630 words and their associated outcome classes, oddsratios can be computed. First, the words were again filtered so that terms with at least 50 occurrences in the 10,000 profiles would be analyzed. This cutoff was determined so that there was a sufficient frequency for modeling. With this constraint the potential number of keywords was reduced to 4,940 terms. For each of these, the oddsratio and associated pvalue were computed. The pvalue tests the hypothesis that the odds of the keyword occurring in either professional group are equal (i.e. 1.0). However, the pvalues can easily provide misleading results for two reasons:
In isolation, the pvalue only relates the the question “Is there a difference in the odds between the two groups?” The more appropriate question is “How much of a difference is there between the groups?” The pvalue does not measure the magnitude of the differences.
When testing a single (predefined) hypothesis, the false positive rate for a hypothesis test using \(\alpha = 0.05\) is 5%. However, we are conducting 4,940 tests on the same data set which raises the false positive rate exponentially. The falsediscovery rate (FDR) pvalue correction (Efron and Hastie 2016) was developed for just this type of scenario. This procedure uses the entire distribution of pvalues and attenuates the false positive rate. If a particular keyword has an FDR value of 0.30, this implies that the collection of keywords with FDR values less than 0.30 have collective 30% false discovery rate. For this reason, the focus here will be on the FDR values generated using the BenjaminiHochberg correction (Benjamini and Hochberg 1995).
To characterize these results, a volcano plot is shown in Figure 5.4 where the estimated oddsratio is shown on the xaxis and the minus log of the FDR values are shown on the yaxis (where larger values indicate higher statistical significance). The size of the points is associated with the number of events found in the sampled data set. Keywords falling in the upper left and rightrand side would indicate a strong difference between the classes and that is unlikely to be random result. The plot shows far more keywords that have a higher likelihood to be found in the STEM profiles than in the nonSTEM professions as more points fall on the upper righthand side of one on the xaxis. Several of these have extremely high levels of statistical significance with FDR values that are vanishingly small.
As a rough criteria for “importance”, keywords with oddsratios of at least 2 (in either direction) and an FDR value less than \(10^{5}\) will be considered for modeling. This results in 50 keywords:
biotech

climbing

code

coding

company

computer

computers

data

developer

ender

engineer

engineering

feynman

firefly

fixing

geek

geeky

im

internet

lab

law

lol

marketing

math

matrix

mechanical

neal

nerd

problems

programmer

programming

robots

science

scientist

silicon

software

solve

solving

startup

stephenson

student

systems

teacher

tech

techie

technical

technology

therapist

web

websites

Of these keywords, only 7 were enriched in the nonSTEM profiles: im
, lol
, law
, teacher
, student
, therapist
, marketing
. Of the STEMenriched keywords, many of these make sense (and play to stereotypes). The majority are related to occupation (e.g. engin
, startup
, and scienc
), while others are clearly related to popular geek culture, such as firefli
^{44}, neal
, and stephenson
^{45}, and scifi
.
One final set of 9 features were computed related to the sentiment and language of the essays. There are curated collections of words with an assigned sentiment values. For example, “horrible” has a fairly negative connotation while “wonderful” is associated with positivity. Words can be assigned qualitative assessments (e.g. “positive”, “neutral”, etc.) or numeric scores where neutrality is given a value of zero. In some problems, sentiment might be a good predictor of the outcome. This feature set included sentimentrelated measures, as well as measures of the pointofview (i.e. first, second, or thirdperson text) and other language elements.
Do these features have an impact on a model? A series of logistic regression models were computed using different feature sets:
A basic set of profile characteristics unrelated to the essays (e.g. age, religion, etc.) consisting of 168 predictors (after generating dummy variables). This resulted in a baseline area under the ROC curve of 0.786. The individual resamples for this model are shown in Figure 5.5.
The addition of the simple text features increased the number of predictors to 182. Performance was not appreciably affected as the AUC value was 0.79. These features were discarded from further use.
The keywords were added to the basic profile features. In this case, performance jumped to 0.847. Recall that these were derived form a smaller portion of these data and that this estimate may be slightly optimistic. However, the entire training set was crossvalidated and, if overfitting was severe, the resampled performance estimates would not be very good.
The sentiment and language features were added to this model, producing an AUC of 0.848. This indicates that these aspects of the essays did not provide any predictive value above and beyond what was already captured by the previous features.
Overall, the keyword and basic feature model would be the best version found here^{46}.
The strategy shown here for computing and evaluating features from text is fairly simplistic and is not the only approach that could be taken. Other methods for preprocessing text data include:
 removing commonly used stop words, such as “is”, “the”, “and”, etc.
 stemming the words so that similar words, such as the singular and plural versions, are represented as a single entity.
The SMART lexicon of stop words (Lewis et al. 2004) was filtered out of the current set of words, leaving 63,118 unique results. The words were then “stemmed” using the Porter algorithm (Willett 2006) to convert similar words to a common root. For example, these 7 words are fairly similar: "teach"
, "teacher"
, "teachers"
, "teaches"
, "teachable"
, "teaching"
, "teachings"
. Stemming would reduce these to 3 unique values: "teach"
, "teacher"
, "teachabl"
Once these values were stemmed, there were 45,704 unique words remaining. While this does reduce the potential number of terms to analyze, there are potential drawbacks related to a reduced specificity of the text. Schofield and Mimno (2016) and Schofield, Magnusson, and Mimno (2017) demonstrate that there can be harm done using these preprocessors.
One other method for determining relevant features uses the term frequencyinverse document frequency (tfidf
) statistic (Amati and Van R 2002). Here the goal is to find words or terms that are important to individual documents in the collection of documents at hand. For example, if the words in this book were processed, there are some that would have high frequency (e.g. predictor
, encode
, or resample
) but are unusual in most other contexts.
For a word W
in document D
, the term frequency tf
is the number of times that W
is contained in D
(usually adjusted for the length of D
). The inverse document frequency idf
is a weight that will normalize by how often the word occurs in the current collection of documents. As an example, suppose the term frequency of the word feynman
in a specific profile was 2. In the sample of profiles used to derive the features, this word occurs 55 times out of 10,000. The inverse document frequency is usually represented as the log of the ratio, or \(log_2(10000/55)\) or 7.5. The tfidf
value for feynman in this profile would be \(2\times\log_2(10000/55)\) or 15. However, suppose that in the same profile, the word internet
also occurs twice. This word is more prevalent across profiles and is contained at least once 1529 times. The tfidf
value here is 5.4; in this way the same raw count is downweighted according to its abundance in the overall data set. The word feynman
occurs the same number of times as internet
in this hypothetical profile, but feynman
is more distinctive or important (as measured by tfidf
) because it is more rare overall, and thus possibly more effective as a feature for prediction.
One potential issue with tdidf
in predictive models is the notion of the “current collection of documents”. This can be computed for the training set but what should be done when a single new sample is being predicted (i.e. there is no collection)? One approach is to use the overall idf
values from the training set to weight the term frequency found in new samples.
Additionally, all of the previous approaches have considered a single word at a time. Sequences of consecutive words can also be considered and might provide additional information. ngrams are terms that are sequences of n consecutive words. One might imagine that the sequence “I hate computers” would be scored differently than the simple term “computer” when predicting whether a profile corresponds to a STEM profession.
5.7 Factors versus Dummy Variables in TreeBased Models
As previously mentioned, certain types of models have the ability to use categorical data in its natural form (i.e. without conversions to dummy variables). A simple regression tree (L. Breiman et al. 1984) using the day of the week predictor for the Chicago data resulted in this simple split for prediction:
if day in {Sun, Sat} then ridership = 4.4K
else ridership = 17.3K
Suppose the day of the week had been converted to dummy variables. What would have occurred? In this case, the model is slightly more complex since it can only create rules as a function of a single dummy variable at a time:
if day = Sun then ridership = 3.84K
else if day = Sat then ridership = 4.96K
else ridership = 17.30K
The nondummy variable model could have resulted in the same structure if it had found that the results for Saturday and Sunday were sufficiently different to merit an extra set of splits. This leads to the question related to using categorical predictors in treebased models: does it matter how the predictions are encoded?
To answer this question, a series of experiments were conducted. Several classification data sets were used to make the comparison between the different encodings. These are summarized in Table 5.5. Two data sets contained both ordered and unordered factors. As described below, the ordered factors were treated in different ways.
Attrition  Cars  Churn  German Credit  HPC  

n  1470  1728  5000  1000  4331 
p  30  6  19  20  7 
Classes  2  4  2  2  4 
Numeric Predictors  16  0  15  7  5 
Factor Predictors  14  6  4  13  2 
Ordered Factors  7  0  0  1  0 
Factor with 3+ Levels  12  7  2  11  3 
Factor with 5+ Levels  3  0  1  5  2 
For each iteration of the simulation, 75% of the data was used for the training set and 10fold crossvalidation was used to tune the models. When the outcome variable had two levels, the area under the ROC curve was maximized. For the other datasets, the multinomial loglikelihood was optimized. The same number of tuning parameter values were evaluated although, in some cases, the values of these parameters were different due to the number of predictors in the data before and after dummy variable generation. The same resamples and random numbers were used for the models with and without dummy variables. When the data contained a significant class imbalance, the data were downsampled to compensate.
For each data set, models were fit with the data in its original form as well as (unordered) dummy variables. For the two data sets with ordinal data, an additional model was created using ordinal dummy variables (i.e. polynomial contrasts).
Several models were fit to the data sets. Unless otherwise stated, the default software parameters were used for each model.
 Single CART trees (L. Breiman et al. 1984) The complexity parameter was chosen using the “onestandard error rule” that is internal to the recursive partitioning algorithm.
 Bagged CART trees (Breiman 1996). Each model contained 50 constituent trees.
 Single C5.0 trees and single C5.0 rulesets (Kuhn and Johnson 2013)
 Single conditional inference trees (Hothorn, Hornik, and Zeileis 2006). A grid of 10 values for the pvalue threshold for splitting were evaluated.
 Boosted CART trees (a.k.a. stochastic gradient boosting, Friedman (2002) ). The models were optimized over the number of trees, the learning rate, the tree depth, and the number of samples required to make additional splits. Twentyfive parameter combinations were evaluate using random search.
 Boosted C5.0 trees. These were tuned for the number of iterations.
 Boosted C5.0 rules. These were also tuned for the number of iterations.
 Random forests using CART trees (Breiman 2001). Each forest contained 1,500 trees and the number of variables selected for a split was tuned over a grid of 10 values.
 Random forests using conditional (unbiased) inference trees (Strobl et al. 2007). Each forest contained 100 trees and the number of variables selected for a split was tuned over a grid of 10 values.
For each model a variety of different performance metrics were estimated using resampling as well as the total time to train and tune the model. The programs used in the simulation are contained in the GitHub repository topepo/dummiesvsfactors
.
For the three data sets with two classes, Figure 5.6 shows a summary of the results for the area under the ROC curve. In the plot, the percent difference in performance is calculated using
\[ \%Difference = \frac{Factor  Dummy}{Factor}\times 100 \]
In this way, positive values indicate that the factor encodings have better performance. The image shows the median in addition to the lower and upper 5% percentiles of the distribution. The gray dotted line indicates no difference between the encodings.
The results of these simulations shows that, for these data sets, there is no real difference in the area under the ROC curve between the encoding methods. This also appears to be true when simple factor encodings are compared to dummy variables generated from polynomial contrasts for ordered predictors^{47}. Of the ten models, there are only two cases out of 40 scenarios where the mainstream of the distributions did not cover zero. Stochastic gradient boosting and bagged CART trees are both ensemble methods and these showed a 2%4% drop in the ROC curve when using factors instead of dummy variables for a single data set.
Another metric, overall accuracy, can also be assessed. These results are shown in Figure 5.7 where all of the models can be taken into account. In this case, the results are mixed. When comparing factors to unordered dummy variables, two of the models show differences in encodings. The churn data shows similar results to the ROC curve metrics. The car evaluation data demonstrates a nearly uniform effect where factor encodings do better than dummy variables. Recall that in the car data, which has four classes, all of the predictors are categorical. For this reason, it is likely to show the most effect of all of the data sets.
In the case of dummy variables generated using polynomial contrasts, neither of the data sets show a difference between the two encodings. However, the car evaluation data shows a pattern where the factor encodings had no difference compared to polynomial contrasts but when compared to unordered dummy variables, the factor encoding is superior. This indicates that the underlying trend in the data follows a polynomial pattern.
In terms of performance, it appears that differences between the two encodings are rare (but can occur). One might infer that, since the car data contains all categorical variables, that this situation would be a good indicator for when to use factors instead of dummy variables. However, two of the data sets (Attrition and German Credit), have a high proportion of categorical predictors and show no difference. In some data sets, the effect of the encoding would depend on whether the categorical predictor(s) are important to the outcome and in what way.
In summary, while few differences were seen, it is very difficult to predict when a difference will occur.
However, one other statistic was computed for each of the simulations: the time to train the models. Figure 5.8 shows the speedup of using factors above and beyond dummy variables (i.e. a value of 2.5 indicates that dummy variable models are two and a half times slower than factor encoding models). Here, there is very strong trend that factorbased models are more efficiently trained than their dummy variable counterparts. The reason for this is likely to be that the expanded number of predictors (caused by generating dummy variables) requires more computational time than the method for determining the optimal split of factor levels. The exceptions to this trend are the models using conditional inference trees.
For a guideline, we suggest using the predictors without converting to dummy variables and, if the model appears promising, to also try refitting using dummy variables.
5.8 Computing
R programs for reproducing these analysis can be found at https://github.com/topepo/TBD
.
References
Haase, R. 2011. Multivariate General Linear Models. Sage.
Timm, N, and J Carlson. 1975. “Analysis of Variance Through Full Rank Models.” Multivariate Behavioral Research Monographs. Society of Multivariate Experimental Psychology.
Preneel, B. 2010. “Cryptographic Hash Functions: Theory and Practice.” In ICICS, 1–3.
Weinberger, K, A Dasgupta, J Langford, A Smola, and J Attenberg. 2009. “Feature Hashing for Large Scale Multitask Learning.” In Proceedings of the 26th Annual International Conference on Machine Learning, 1113–20. ACM.
Box, GEP, W Hunter, and J Hunter. 2005. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. Wiley.
MicciBarreca, D. 2001. “A Preprocessing Scheme for HighCardinality Categorical Attributes in Classification and Prediction Problems.” ACM SIGKDD Explorations Newsletter 3 (1):27–32.
Zumel, N., and J. Mount. 2016. “vtreat
: a data.frame processor for predictive modeling.” arXiv.org.
McElreath, R. 2015. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman; Hall/CRC.
West, K, Band Welch, and A Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.
Guo, C, and F Berkhahn. 2016. “Entity embeddings of categorical variables.” arXiv.org.
Chollet, F, and JJ Allaire. 2018. Deep Learning with R. Manning.
Agresti, A. 2012. Categorical Data Analysis. WileyInterscience.
Christopher, D, R Prabhakar, and S Hinrich. 2008. Introduction to Information Retrieval. Cambridge University Press.
Silge, J, and D Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly.
Efron, B, and T Hastie. 2016. Computer Age Statistical Inference. Cambridge University Press.
Benjamini, Y, and Y Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 289–300.
Lewis, D, Y Yang, T Rose, and F Li. 2004. “Rcv1: A New Benchmark Collection for Text Categorization Research.” Journal of Machine Learning Research 5:361–97.
Willett, P. 2006. “The Porter Stemming Algorithm: Then and Now.” Program 40 (3):219–23.
Schofield, A, and D Mimno. 2016. “Comparing Apples to Apple: The Effects of Stemmers on Topic Models.” Transactions of the Association for Computational Linguistics 4:287–300.
Schofield, A, M Magnusson, and D Mimno. 2017. “Understanding Text PreProcessing for Latent Dirichlet Allocation.” In Proceedings of the 15th Conference of the European chapter of the Association for Computational Linguistics, 2:432–36.
Amati, G, and Cornelis J Van R. 2002. “Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness.” ACM Transactions on Information Systems 20 (4):357–89.
Breiman, L., J. Friedman, R. Olshen, and C. Stone. 1984. Classification and Regression Trees. New York: Chapman; Hall.
Breiman, L. 1996. “Bagging Predictors.” Machine Learning 24 (2). Springer:123–40.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.
Hothorn, T, K Hornik, and A Zeileis. 2006. “Unbiased Recursive Partitioning: A Conditional Inference Framework.” Journal of Computational and Graphical Statistics 15 (3). Taylor & Francis:651–74.
Friedman, J. 2002. “Stochastic Gradient Boosting.” Computational Statistics & Data Analysis 38 (4). Elsevier:367–78.
Breiman, L. 2001. “Random Forests.” Machine Learning 45 (1). Springer:5–32.
Strobl, C, AL Boulesteix, A Zeileis, and T Hothorn. 2007. “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.” BMC Bioinformatics 8 (1):25.
These computations use the “MurmurHash3” hash found at
https://github.com/aappleby/smhasher
and implemented in theFeatureHashing
R package.↩Traditional analysis of alias structures could be applied to this problem. However, more research would be needed since those methods are usually applied to balanced designs where the values of the predictors have the same prevalence. Also, as previously mentioned, they tend to focus on aliasing between multiple variables. In any case, there is much room for improvement on this front.↩
In this situation, the shrinkage of the values towards the mean is called partial pooling of the estimates.↩
This was the feature set that was used in Chapter 3 when resampling and model comparisons were discussed.↩
Note that many of the nonensemble methods, such as C5.0 trees/rules, CART, and conditional inference trees, show a significant amount of variation. This is due to the fact that they are unstable models with high variance.↩