5.2 Encoding Predictors with Many Categories
If there are \(C\) categories, what happens when \(C\) becomes very large? For example, ZIP code in the United States may be an important predictor for outcomes that are affected by a geographic component. There are more than 40K possible ZIP codes and, depending on how the data are collected, this might produce an overabundance of dummy variables (relative to the number of data points). As mentioned in the previous section, this can cause the data matrix to be overdetermined and restrict the use of certain models. Also, ZIP codes in highly populated areas may have a higher rate of occurrence in the data, leading to a “long tail” of locations that are infrequently observed.
One potential issue is that resampling might exclude some of the more rare categories from the analysis set. This would lead to dummy variable columns in the data that contain all zeros and, for many models, this would become a numerical issue that will cause an error. Moreover the model will not be able to provide a relevant prediction for new samples that contain this predictor. When a predictor contains a single value, we call this a zero-variance predictor because there truly is no variation displayed by the predictor.
The first way to handle this issue is to create the full set of dummy variables and simply remove the zero-variance predictors. This is a simple and effective approach but it may be difficult to know a priori what terms will be in the model. In other words, during resampling, there may be a different number of model parameters across resamples. This can be a good side-effect since it captures the variance caused by omitting rarely occurring values and propagates this noise into the resampling estimates of performance.
Prior to producing dummy variables, one might consider determining which (if any) of these variables are near-zero variance predictors or have the potential to have near zero variance during the resampling process. These are predictors that have few unique values (such as two values for binary dummy variables) and occur infrequently in the data (Kuhn and Johnson 2013). For the training set, we would consider the ratio of the frequencies of the most commonly occurring value to the second most commonly occurring. For dummy variables, this is simply the ratio of the numbers of ones and zeros. Suppose that a dummy variable has 990 values of zero and 10 ones. The ratio of these frequencies, 99, indicates that it could easily be converted into all zeros during resampling. We suggest a rough cutoff of 19 to declare such a variable “too rare” although the context may raise or lower this bar for different problems.
Although near-zero variance predictors likely contain little valuable predictive information, we may not desire to filter these out. One way to avoid filtering these predictors is to redefine the predictor’s categories prior to creating dummy variables. Instead an “other” category can be created that pools the rarely occurring categories, assuming that such a pooling is sensible. A cutoff based on the training set frequency can be specified that indicates which categories should be combined. Again, this should occur during resampling so that the final estimates reflect the variability in which predictor values are included in the “other” group.
Another way to combine categories is to use a hashing function (or hashes). Hashes are used to map one set of values to another set of values and are typically used in databases and cryptography (Preneel 2010). The original values are called the keys and are typically mapped to a smaller set of artificial hash values. In our context, a potentially large number of predictor categories are the keys and we would like to represent them using a smaller number of categories (i.e. the hashes). The number of possible hashes is set by the user and, for numerical purposes, is a power of 2. Some computationally interesting aspects to hash functions are
- The only data required is the value being hashed and the resulting number of hashes. This is different from the previously described contrast function using a reference cell.
- The translation process is completely deterministic.
- In traditional applications, it is important that the number of keys that are mapped to hashes is relatively uniform. This would be important if the hashes are used in cryptography since it increases the difficulty of guessing the result. As discussed below, this might not be a good characteristic for data analysis.
- Hash functions are unidirectional; once the hash values are created, there is no way of knowing the original values. If there are a known and finite set of original values, a table can be created to do the translation but, otherwise, the keys are indeterminable when only the hash value is known.
- There is no free lunch when using this procedure; some of the original categories will be mapped to the same hash value (called a “collision”). The number of collisions will be largely determined by the number of features that are produced.
When using hashes to create dummy variables, the procedure is called “feature hashing” or the “hash trick” (Weinberger et al. 2009). Since there are many different hashing functions, there are different methods for producing the table that maps the original set to the reduced set of hashes.
|Hash Integer Value||16 cols||256 cols|
|south san francisco||-373608504||9||201|
The hash value is an integer value derived solely from each city’s text value. In order to convert this to a new feature in the data set, modular arithmetic is used. Suppose that 16 features are desired. To create a mapping from the integer to one of the 16 feature columns, integer division is used on the hash value using the formula \(column = (integer\mod 16) + 1\). For example, for the string “mountain view”, we have \((-1267876914 \mod 16) + 1 = 15\), so this city is mapped to the fifteenth dummy variable feature. The last column above shows the assignments for the cities when 256 feature are requested; \(\mod 256\) has more possible remainders than \(\mod 16\) so more features are possible. With 16 features, the dummy variable assignments for a selection of locations in the data are:
|south san francisco||1||0||0||0||0||0||1||0||0||0||0||0|
There are several characteristics of this table to notice. First there are 4 hashes (columns) that are not shown in this table since they contain all zeros. This occurs because none of the hash integer values had those specific mod 16 remainders. This could be a problem for predictors that have the chance to take different string values than the training data and would be placed in one of these columns. In this case, it would be wise to create an ‘other’ category to ensure that new strings have a hashed representation. Next notice that there were 6 hashes with no collisions (i.e. have a single 1 in their column). But several columns exhibit collisions. An example of a collision is hash number 14 which encodes for both Menlo Park and San Leandro. In statistical terms, these two categories are said to be aliased or confounded, meaning that a parameter estimated from this hash cannot isolate the effect of either city. Alias structures have long been studied in the field of statistical experimental design although usually in the context of aliasing between variables rather than within variables (Box, Hunter, and Hunter 2005). Regardless, from a statistical perspective, minimizing the amount and degree of aliasing is an important aspect of a contrast scheme.
Some hash functions can also be signed, meaning that instead of producing a binary indicator for the new feature, possible values could be -1, 0, or +1. A zero would indicate that the category is not associated with that specific feature and the \(\pm 1\) values can indicate different values of the original data. Suppose that two categories map to the same hash value. A binary hash would result in a collision but a signed hash could avoid this by encoding one category as +1 and the other as -1. Here are the same data as above with a signed hash of the same size:
|south san francisco||1||0||0||0||0||0||-1||0||0||0||0||0|
Notice that the collision in hash 14 is now gone since a value of 1 corresponds to Menlo Park and San Leandro is encoded as -1. For a linear regression model, the coefficient for the 14th hash now represents the positive or negative shift in the response due to Menlo Park or San Leandro, respectively.
What level of aliasing should we expect? To investigate, data were simulated so that there were \(10^4\) predictor categories and each category consisted of 20 unique, randomly generated character values. Binary and signed hashes of various sizes were generated and the collision rate was determined. This simulation was done 10 different times and the average percent collision was calculated. Figure 5.1 shows the results where the y-axis is the rate of any level of collision in the features. As expected, the signed values are associated with less collisions than the binary encoding. The green line corresponds to the full set of dummy variables (without using hashes). For these data, using signed features would give the best aliasing results but the percentage of collisions is fairly high.
Because of how the new features are created, feature hashing is oblivious to the concept of predictor aliasing. Uniformity of the hash function is an important characteristic when hashes are used in their typical applications. However, it may be a liability in data analysis. The reduction of aliasing is an important concept in statistics; we would like to have the most specific representations of categories as often as possible for a few different reasons:
- Less aliasing allows for better interpretation of the results. If a predictor has a significant impact on the model, it is probably a good idea understand why. When multiple categories are aliased due to a collision, untangling the true effect may be very difficult. Although this may help narrow down which categories are affecting the response.
- Categories involved in collisions are not related in any meaningful way. For example, San Leandro and Menlo Park were not aliased because they share a particular similarity or geographic closeness. Because of the arbitrary nature of the collisions, it is possible to have different categories whose true underlying effect are counter to one another. This might have the effect of negating the impact of the hashed feature.
- Hashing functions have no notion of the probability that each key will occur. As such, it is conceivable that a category that occurs with great frequency is aliased with one that is rare. In this case, the more abundant value will have a much larger influence on the effect of that hashing feature.
Although we are not aware of any statistically-conscious competitor to feature hashing, there are some characteristics that a contrast function should have when used with a predictor containing a large number of categories44. For example, if there is the capacity to have some categories unaliased with any others, the most frequently occurring values should be chosen for these roles. The reason for this suggestion is that it would generate a high degree of specificity for the categories that are most likely to be exposed to the contrast function. One side effect of this strategy would be to alias less frequently occurring values to one another. This might be beneficial since it reduces the sparsity of those dummy variables. For example, if two categories that each account for 5% of the training set values are combined, the dummy variable has a two-fold increase in the number of ones. This would reduce the chances of being filtered out during resampling and might also lessen the potential effect that such a near-zero variance predictor might have on the analysis.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
Preneel, B. 2010. “Cryptographic Hash Functions: Theory and Practice.” In ICICS, 1–3.
Weinberger, K, A Dasgupta, J Langford, A Smola, and J Attenberg. 2009. “Feature Hashing for Large Scale Multitask Learning.” In Proceedings of the 26th Annual International Conference on Machine Learning, 1113–20. ACM.
Box, GEP, W Hunter, and J Hunter. 2005. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. Wiley.
Traditional analysis of alias structures could be applied to this problem. However, more research would be needed since those methods are usually applied to balanced designs where the values of the predictors have the same prevalence. Also, as previously mentioned, they tend to focus on aliasing between multiple variables. In any case, there is much room for improvement on this front.↩