5.5 Encodings for Ordered Data

In the previous sections, an unordered predictor with \(C\) categories can be represented by \(C-1\) binary dummy variables or a hashed version of binary dummy variables. These methods effectively present the categorical information to the models. But now suppose that the \(C\) categories have a relative ordering. For example, consider a predictor that has the categories of “low”, “medium”, and “high.” This information could be converted to 2 binary predictors: \(X_{low} = 1\) for samples categorized as “low” and = 0 otherwise, and \(X_{medium} = 1\) for samples categorized as “medium” and = 0 otherwise. These new predictors would accurately identify low, medium, and high samples, but the two new predictors would miss the information contained in the relative ordering, which could be very important relative to the response.

Ordered categorical predictors would need a different way to be presented to a model to uncover the relationship of the order to the response. Ordered categories may have a linear relationship with the response. For instance, we may see an increase of approximately 5 units in the response as we move from low to medium and an increase of approximately 5 units in the response as we move from medium to high. To allow a model to uncover this relationship, the model must be presented with a numeric encoding of the ordered categories that represents a linear ordering. In the field of Statistics this type of encoding is referred to as a polynomial contrast. A contrast has the characteristic that it is a single comparison (i.e., one degree of freedom) and its coefficients sum to zero. For the “low”, “medium”, “high” example above the contrast to uncover a linear trend would be -0.71, 0, 0.71, where low samples encode to -0.71, medium samples to 0, and high samples to 0.71. Polynomial contrasts can extend to nonlinear shapes, too. If the relationship between the predictor and the response was best described by a quadratic trend, then the contrast would be 0.41, -0.82, and 0.41 (Table 5.3). Conveniently, these types of contrasts can be generated for predictors with any number of ordered factors, but the complexity of the contrast is constrained to one less than the number of categories in the original predictor. For instance, we could not explore a cubic relationship between a predictor and the response with a predictor that has only 3 categories.

By employing polynomial contrasts, we can investigate multiple relationships (linear, quadratic, etc.) simultaneously by including these in the same model. When we do this, the form of the numeric representation is important. Specifically the new predictors would contain unique information. To do this the numeric representations are required to be orthogonal. This means that the dot product of the contrast vectors is 0.

Table 5.3: An example of linear and quadratic polynomial contrasts for an ordered categorical predictor with three levels.
Original Value	Dummy Variables
	Linear	Quadratic
low	-0.71	0.41
medium	0.00	-0.82
high	0.71	0.41

It is important to recognize that patterns described by polynomial contrasts may not effectively relate a predictor to the response. For example, in some cases, one might expect a trend where “low” and “middle” samples have a roughly equivalent response but “high” samples have a much different response. In this case, polynomial contrasts are unlikely to be effective at modeling this trend. Another downside to polynomial contrasts for ordered categories occurs when there are a moderate to high number of categories. If an ordered predictor has \(C\) levels, the encoding into dummy variables uses polynomials up to degree \(C-1\). It is very unlikely that these higher-level polynomials are modeling important trends (e.g., octic patterns) and it might make sense to place a limit on the polynomial degree. In practice, we rarely explore the effectiveness of anything more than a quadratic polynomial.

As an alternative to polynomial contrasts, one could:

Treat the predictors as unordered factors. This would allow for patterns that are not covered by the polynomial feature set. Of course, if the true underlying pattern is linear or quadratic, unordered dummy variables may not effectively uncover this trend.
Translate the ordered categories into a single set of numeric scores based on context-specific information. For example, when discussing failure modes of a piece of computer hardware, experts would be able to rank the severity of a type of failure on an integer scale. A minor failure might be scored as a “1” while a catastrophic failure mode could be given a score of “10” and so on.

Simple visualizations and context-specific expertise can be used to understand whether either of these approaches are good ideas.