6.1 1:1 Transformations

There are a variety of modifications that can be made to an individual predictor that might improve its utility in a model. The first type of transformations to a single predictor discussed here are those that change the scale of the data. A good example is the transformation described in Figure 1.3 of the first chapter. In that case, the two predictors had very skewed distributions and it was shown that using the inverse of the values improved model performance. Figure 6.1(a) shows the test set distribution for one of the predictors from Figure 1.3.

Figure 6.1: The distribution of a skewed predictor before (a) and after (b) applying the Box-Cox transformation.

A Box-Cox transformation (Box and Cox 1964) was used to estimate this transformation. The Box-Cox procedure, originally intended as a transformation of a model’s outcome, uses maximum likelihood estimation to estimate a transformation parameter \(\lambda\) in the equation

\[ x^{*} = \left\{ \begin{array}{l l} \frac{x^{\lambda}-1}{\lambda\: \tilde{x}^{\lambda-1}}, & \lambda \neq 0 \\ \tilde{x} \: \log x, & \lambda = 0 \\ \end{array} \right. \]

where \(\tilde{x}\) is the geometric mean of the predictor data. In this procedure, \(\lambda\) is estimated from the data. Because the parameter of interest is in the exponent, this type of transformation is called a power transformation. Some values of \(\lambda\) map to common transformations, such as \(\lambda = 1\) (no transformation), \(\lambda = 0\) (log), \(\lambda = 0.5\) (square root), and \(\lambda = -1\) (inverse). As you can see, the Box-Cox transformation is quite flexible in its ability to address many different data distributions. For the data in Figure 6.1, the parameter was estimated from the training set data to be \(\widehat{\lambda} = -1.09\). This is effectively the inverse transformation. Figure 6.1(b) shows the results when the transformation is applied to the test set, and yields a transformed distribution that is approximately symmetric. It is important to note that the Box-Cox procedure can only be applied to data that is strictly positive. To address this problem, Yeo and Johnson (2000) devised an analogous procedure that can be used on any numeric data.

Also, note that both transformations are unsupervised since, in this application, the outcome is not used in the computations. While the transformation might improve the predictor distribution, it has no guarantee of improving the model. However, there are a variety of parametric models that utilize polynomial calculations on the predictor data, such as most linear models, neural networks, and support vector machines. In these situations, a skewed predictor distribution can have a harmful effect on these models since the tails of the distribution can dominate the underlying calculations.

It should be noted that the Box-Cox transformation was originally used as a supervised transformation of the outcome. A simple linear model would be fit to the data and the transformation would be estimated from the model residuals. The outcome variable would be transformed using the results of the Box-Cox method. Here, the method has been appropriated to be independently applied to each predictor and uses their data, instead of the residuals, to determine an appropriate transformation.

Another important transformation to an individual variable is for a variable that has values bounded between zero and one, such as proportions. The problem with modeling this type of outcome is that model predictions might may not be guaranteed to be within the same boundaries. For data between zero and one, the logit transformation could be used. If \(\pi\) is the variable, the logit transformations is

\[ logit(\pi) = log\left(\frac{\pi}{1-\pi}\right) \]

This transformation changes the scale from values between zero and one to values between negative and positive infinity. On the extremes, when the data are absolute zero or one, a small constant can be added or subtracted to avoid division by zero. Once model predictions are created, the inverse logit transformation can be used to place the values back on their original scale. An alternative to the logit transformation is the arcsine transformation. This is primarily used on the square root of the proportions (e.g., \(y^* = arcsine(\sqrt{\pi})\)).

Another common technique for modifying the scale of a predictor is to standardize its value in order to have specific properties. Centering a predictor is a common technique. The predictor’s training set average is subtracted from the predictor’s individual values. When this is applied separately to each variable, the collection of variables would have a common mean value (i.e., zero). Similarly, scaling is the process of dividing a variable by the corresponding training set’s standard deviation. This ensures that that variables have a standard deviation of one. Alternatively, range scaling uses the training set minimum and maximum values to translate the data to be between an arbitrary range (usually zero and one). Again, it is emphasized that the statistics required for the transformation (e.g., the mean) are estimated from the training set and are applied to all data sets (e.g., the test set or new samples).

These transformations are mostly innocuous and are typically needed when the model requires the predictors to be in common units. For example, when the distance or dot products between predictors are used (such as K-nearest neighbors or support vector machines) or when the variables are required to be a a common scale in order to apply a penalty (e.g., the lasso or ridge regression described in Section 7.3), a standardization procedure is essential.

Another helpful preprocessing methodology that can be used on data containing a time or sequence effect is simple data smoothing. For example, a running mean can be used to reduce excess noise in the predictor or outcome data prior to modeling. For example, a running 5-point mean would replace each data point with the average of itself and the two data points before and after its position⁵³. As one might expect, the size of the moving window is important; too large and the smoothing effect can eliminate important trends, such as nonlinear patterns.

A short running median can also be helpful, especially if there are significant outliers. When an outlier falls into a moving window, the mean value is pulled towards the outlier. The median would be very insensitive to an aberrant value and is a better choice. It also has the effect of changing fewer of the original data points.

For example, Figure 6.2 shows a sequence of data over 61 days. The raw data are in the top panel and there are two aberrant data values on days 10 and 45. The lower panel shows the results of a 3-point running median smoother. It does not appear to blunt the nonlinear trend seen in the last 15 days of data but does seem to mitigate the outliers. Also, roughly 40% of the smoothed data have the same value as the original sequence.

A sequence of outcome values over time. The raw data contain outliers on days 10 and 45. The smoothed values are the result of a 3-point running median.

Figure 6.2: A sequence of outcome values over time. The raw data contain outliers on days 10 and 45. The smoothed values are the result of a 3-point running median.

Other smoothers can be used, such as smoothing splines (also described in this chapter). However, the simplicity and robustness of a short running median can be an attractive approach to this type of data. This smoothing operation can be applied to the outcome data (as in Figure 6.2) and/or to any sequential predictors. The latter case is mentioned in Section 3.4.7 in regard to information leakage. It is important to make sure that the test set predictor data are smoothed separately to avoid having the training set influence values in the test set (or new unknown samples). Smoothing predictor data to reduce noise is explored more in Section 9.4.

There are various approaches for the beginning and end of the data, such as leaving the data as is.↩