6 Engineering Numeric Predictors

In the previous chapter we provided methods for skillfully modifying qualitative-type predictors. Often, other predictors have continuous, real number values. Developing tools for converting these types of predictors into a form that a model can better utilize.

Predictors that are on a continuous scale are subject to a host of potential issues that we must learn how to mitagate. Some of the problems that are prevalent with continuous predictors can be mitigated through the type of model that we choose. For example, models that construct relationships between the predictors and the response that are based on the rank of the predictors’ values rather than the actual value, like tree-based modeling techniques, are immune to predictor distributions that are skewed or to individual samples that have unusual values (i.e. outliers). Other models such as K-nearest neighbors and support vector machines are much more sensitive to predictors with skewed distributions or outliers. Continuous predictors that are highly correlated with each other are another regularly occurring scenario that present a problem for some models but not for others. Partial least squares, for instance, is specifically built to directly handle highly correlated predictors. But models like multiple linear regression or neural networks are adversely affected in this situation.

If we desire to utilize and explore the predictive ability of more types of models, we need to address the issues presented by the predictors and engineer them in a useful way.

In this chapter we will provide such approaches for and illustrate how to handle continuous predictors with commonly occurring issues. Our data may:

  • be on vastly different scales.
  • follow a skewed distribution where a small proportion of samples are orders of magnitude larger than the majority of the data (i.e. skewness).
  • contain a small number of extreme values.
  • be censored on the low and/or high end of the range. This indirectly indicates that samples at the extremes have less accurate values than samples in the measurement range.
  • have a complex relationship with the response that is truly predictive but cannot be adequately represented with a common function or extracted by sophisticated models.
  • contain relevant and overly redundant information. That is, the information collected could be more effectively and efficiently represented with a smaller, consolidated number of new predictors while still preserving or enhancing the new predictors’ relationship with the response.

We have organized the techniques in this chapter into three general categories . The first category of engineering techniques are those that address problematic characteristics of individual predictors (Section 6.1). Section 6.2 illustrates methods for expanding individual predictors into many predictors in order to better represent more complex predictor-response relationships and to enable the extraction of predictive information. Last, Section 6.3 provides methodology for consolidating information across many predictors. In the end, the goal of all of these approaches is to convert the existing continuous predictors into a form that can be utilized by any model and presents the most useful information to the model.

As with many techniques discussed in this text, the need for these can be very data- and model-dependent. For example, transformations to resolve skewness or outliers would not be needed for some models but would be critical for others to perform well. In this chapter, a guide to which models would benefit from specific preprocessing methods is given.

6.1 1:1 Transformations

There are a variety of modifications that can be made to an individual predictor that might improve its utility in a model. The first type of transformations to a single predictor discussed here are those that change the scale of the data. A good example is the transformation described in Figure 1.3 of the first chapter. In that case, the two predictors had very skewed distributions and it was shown that using the inverse of the values improved model performance. Figure 6.1(a) shows the test set distribution for one of the predictors from Figure 1.3.

The distribution of a skewed predictor before (a) and after (b) applying the Box-Cox transformation.

Fig. 6.1: The distribution of a skewed predictor before (a) and after (b) applying the Box-Cox transformation.

A Box-Cox transformation (Box and Cox 1964) was used to estimate this transformation. The Box-Cox procedure, originally intended as a transformation of a model’s outcome, uses maximum likelihood estimation to estimate a transformation parameter \(\lambda\) in the transformation equation

\[ x^{*} = \left\{ \begin{array}{l l} \frac{x^{\lambda}-1}{\lambda\: \tilde{x}^{\lambda-1}}, & \lambda \neq 0 \\ \tilde{x} \: \log x, & \lambda = 0 \\ \end{array} \right. \]

where \(\tilde{x}\) is the geometric mean of the predictor data. In this procedure, \(\lambda\) is estimated from the data. Because the parameter of interest is in the exponent, this type of transformation is called a power transformation. Some values of \(\lambda\) map to common transformations, such as \(\lambda = 1\) (no transformation), \(\lambda = 0\) (log), \(\lambda = 0.5\) (square root), and \(\lambda = -1\) (inverse). As you can see, the Box-Cox transformation is quite flexible in its ability to address many different data distributions. For the data in Figure 6.1, the parameter was estimated from the training set data to be \(\widehat{\lambda} = -1.09\). This is effectively the inverse transformation. Figure 6.1(b) shows the results when the transformation is applied to the test set, and yields a transformed distribution that is approximately symmetric. It is important to note that the Box-Cox procedure can only be applied to data that is strictly positive. Unfortnately, the Box-Cox procedure cannot be applied to problematic distributions that contain zero or negative. To address this problem, Yeo and Johnson (2000) devised an equivalent procedure that can be used on any numeric data.

Also, note that both transformations are unsupervised since, in this application, the outcome is not used in the computations. While the transformation might improve the predictor distribution, it has no guarantee of improving the model. However, there are a variety of parametric models that utilize polynomial calculations on the predictor data, such as most linear models, neural networks, and support vector machines. In these situations, a skewed predictor distribution can have a harmful effect on these models since the tails of the distribution can dominate the underlying calculations.

Another important transformation to an individual variable is for a variable that has values bounded between zero and one, such as proportions. The problem with modeling this type of outcome is that model predictions might may not be guaranteed to be within the same boundaries. For data between zero and one, the logit transformation could be used. If \(\pi\) is the variable, the logit transformations is

\[ logit(\pi) = log\left(\frac{\pi}{1-\pi}\right) \]

This transformation changes the scale from zero and one to values between negative and positive infinity. On the extremes, when the data are absolute zero or one, a small constant can be added or subtracted to avoid division by zero. Once model predictions are created, the inverse logit transformation can be used to place the values back on their original scale. An alternative to the logit transformation is the arcsine transformation. This is primarily used on the square-root of the proportions (e.g. \(y^* = arcsine(\sqrt{\pi})\)).

Another common technique for modifying the scale of a set of predictor is to standardize its value in order to have specific properties. Centering the predictors is a common technique. The predictor’s training set average is subtracted from the predictor’s individual values. When this is applied separately to each variable, the collection of variables would have a common mean value (i.e. zero). Similarly, scaling is the process of dividing a variable by the corresponding training set’s standard deviation. This ensures that that variables have a standard deviation of one. Alternatively, range scaling uses the training set minimum and maximum values to translate the data to be between an arbitrary range (usually zero and one). Again, it is emphasized that the statistics required for the transformation (e.g. the mean) are estimated from the training set and are applied to all data sets (e.g., the test set or new samples).

These transformations are mostly innocuous and are typically needed when the model requires the predictors to be in common units. For example, when the distance or dot products between predictors are used (such as K-nearest neighbors or support vector machines) or when the variables are required to be a a common scale in order to apply a penalty (e.g. the lasso or ridge regression), a standardization procedure is essential.

Another helpful preprocessing methodology that can be used on data containing a time or sequence effect is simple data smoothing. For example, a running mean can be used to reduce excess noise in the predictor or outcome data prior to modeling. For example, a running 5-point mean would replace each data point with the average of itself and the two data points before and after its position44. As one might expect, the size of the moving window is important; too large and the smoothing effect can eliminate important trends, such as nonlinear patterns.

A short running median can also be helpful, especially if there are significant outliers. When an outlier falls into a moving window, the mean value is pulled towards the outlier. The median would be very insensitive to an aberrant value and is a better choice. It also has the effect of changing fewer of the original data points.

For example, Figure 6.2 shows a sequence of data over 61 days. The raw data are in the top panel and there are two aberrant data values on days 10 and 45. The lower panel shows the results of a 3-point running median smoother. It does not appear to blunt the nonlinear trend seen in the last 15 days of data but does seem to mitigate the outliers. Also, roughly 40% of the smoothed data have the same value as the original sequence.

A sequence of outcome values over time. The raw data contain outliers on days 10 and 45. The smoothed values are the result of a 3-point running median.

Fig. 6.2: A sequence of outcome values over time. The raw data contain outliers on days 10 and 45. The smoothed values are the result of a 3-point running median.

Other smoothers can be used, such as smoothing splines (also described in this chapter). However, the simplicity and robustness of a short running median can be an attractive approach to this type of data. This smoothing operation can be applied to the outcome data (as in Figure 6.2) and/or to any sequential predictors. The latter case is mentioned in Section 3.4.6 in regard to information leakage. It is important to make sure that the test set predictor data are smoothed separately to avoid having the training set influence values in the test set (or new unknown samples).

6.2 1:Many Transformations

The previous chapter illustrated the process of creating multiple numeric indicator columns from a single qualitative predictor. Similar one-to-many transformations of the data can be used to improve model performance.

6.2.1 Nonlinear Features via Basis Expansions and Splines

A basis expansion of a predictor \(x\) can be achieved by deriving a set of functions \(f_i(x)\) that can be combined using a linear combination. For example, in the last chapter, polynomial contrast functions were used to encode ordered categories. For a continuous predictor \(x\), a cubic basis expansion is

\[ f(x) = \sum_{i=1}^3 \beta_i f_i(x) = \beta_1 x + \beta_2 x^2 + \beta_3 x^3 \]

To use this basis expansion in the model, the original column is augmented by two new features with squared and cubed versions of the original. The \(\beta\) values could be estimated using basic linear regression. If the true trend were linear, the second two regression parameters would presumably be near zero (at least relative to their standard error). Also, the linear regression used to determine the coefficients for the basis expansion could be estimated in the presence of other regressors.

The sale price and lot area data for the training set.

Fig. 6.3: The sale price and lot area data for the training set.

This type of basis expansion, where the pattern is applied globally to the predictor, can often be insufficient. For example, take the lot size variable in the Ames housing data. When plotted against the sale price45, there is a linearly increasing trend in the mainstream of the data (between \(10^{3.75}\) and \(10^{4.25}\)) but on either side of the bulk of the data, the patterns are negligible.

An alternative method to creating a global basis function that can be used in a regression model is a polynomial spline (Wood 2006, Eilers and Marx (2010)). Here, the basis expansion creates different regions of the predictor space whose boundaries are called knots. The polynomial spline uses polynomial functions, usually cubic, within each of these regions to represent the data. We would like the cubic functions to be connected at the knots, so specialized functions can be created to ensure an overall continuous function46. Here, the number of knots controls the number of regions and also the potential complexity of the function. If the number of knots are low (perhaps three or less), the basis function can represent relatively simple trends. Functions with more knots are more adaptable but also more likely to overfit the data.

The knots are typically chosen using percentiles of the data so that a relatively equal amount of data are contained within each region. For example, a spline with three regions typically places the knots at the 33.3% and 66.7% percentiles. Once the knots are determined, the basis functions can be created and used in a regression model. There are many types of polynomial splines and the approach described here is often called a natural cubic spline.

For the Ames lot area data, let’s consider a spline with 6 regions. The procedure places the knots at the following percentiles: 16.7%, 33.3%, 50%, 66.7%, 83.3%. When creating the natural spline basis functions, the first function is typically taken as the intercept. The other functions of \(x\) generated by this procedure are shown in Figure 6.4(a). Here the blue lines indicate the knots. Note that the first three features contain regions where the basis function value is zero. This implies that those features do not affect that part of the predictor space. The other three features appear to put the most weight on areas outside the mainstream of the data. Panel (b) of this figure shows the final regression form. There is a strong linear trend in the middle of the cloud of points and weak relationships outside of the mainstream. The left-hand side does not appear to fit the data especially well. This might imply that more knots are required.

(a) Features created using a natural spline basis function for the lot area predictor. The blue lines correspond to the knots. (b) The fit after the regression model estimated the model coefficients for each feature.

Fig. 6.4: (a) Features created using a natural spline basis function for the lot area predictor. The blue lines correspond to the knots. (b) The fit after the regression model estimated the model coefficients for each feature.

How many knots should be used? This is a tuning parameter that can be determined over gird search or through visual inspection of the smoother results (e.g. Figure 6.4(b)). In a single dimension, if the model is overfitting, there will be visual evidence. Also, some spline functions use a method called generalized cross-validation (GCV) to estimate the appropriate spline complexity using a computational shortcut for linear regression (Golub, Heath, and Wahba 1979).

Note that the basis function was created prior to being exposed to the outcome data. This does not have to be the case. One method of determining the optimal complexity of the curve is to initially assign every training set point as a potential knot and uses regularized regression models (similar to weight decay or ridge regression) to determine which instances should be considered to be knots. This approach is used by the smoothing spline methodology (Yandell 1993). Additionally, there is a rich class or models called generalized additive models (GAMs). These extend general linear models, which includes linear and logistic regression, to have nonlinear terms for individual predictors (and cannot model interactions). GAM models can adaptively model separate basis functions for different variables and estimate the complexity for each. In other words, different predictors can be modeled with different levels of complexity. Additionally, there are many other types of supervised nonlinear smoothers that can be used. For example, the loess model fits a weighted moving regression line across a predictor to estimate the nonlinear pattern in the data. See Wood (2006) for more technical information about GAM models.

One other feature construction method related to splines and the multivariate adaptive regression spline (MARS) model (Friedman 1991) is the single, fixed knot spline. The hinge function transformation used by that methodology is

\[h(x) = x I(x > 0)\]

Where \(I\) is an indicator function that is zero when x is greater than zero and zero otherwise. For example, if the log of the lot area were represented by \(x\), \(h(x - 3.75)\) generates a new feature that is zero when the log lot area is less than 3.75 and is equal to \(x - 3.75\) otherwise. The opposite feature, \(h(3.75 - x)\), is zero above a value of 3.75. The effect of the transformation can be seen in Figure 6.5(a) which illustrates how a pair of hinge functions isolate certain areas of the predictor space.

These features can be added to models to create segmented regression models that have distinct sections that with different trends. As previously mentioned, the lot area data shown in Figure 6.4 shows that a linear trend can be seen in the central region of the data and, one each extreme, the trends are more flat. Suppose that two sets of hinge functions were added to a linear model with knots at 3.75 and 4.25. This would generate a separate linear trend in the region below a value of \(10^{3.75}\), between \(10^{3.75}\) and \(10^{4.25}\), and above \(10^{4.25}\). When added to a linear regression, the left-most region’s slope would be driven by the “left-handed” hinge functions for both knots. The middle region would involve the slopes for both terms associated with a knot of \(10^{3.75}\) as well as the left-handed hinge function with the knot at \(10^{4.25}\) and so on47. The model fit associated with this strategy is shown in Figure 6.5(b).

(a) Features created using a natural spline basis function for the lot area predictor. The blue lines correspond to the knots. (b) The fit after the regression model estimated the model coefficients for each feature.

Fig. 6.5: (a) Features created using a natural spline basis function for the lot area predictor. The blue lines correspond to the knots. (b) The fit after the regression model estimated the model coefficients for each feature.

This feature generation function is known in the neural network and deep learning fields as the rectified linear unit (ReLU) activation function. Nair and Hinton (2010) gives a summary of their use in these areas.

While basis functions can be effective at helping build better models by representing nonlinear patterns for a feature, they can be very effective for exploratory data analysis. Visualizations, such as Figures 6.4(b) and 6.5, can be used to inform the modeler about what the potential functional form of the prediction should be (e.g. log-linear, quadratic, segmented, etc.).

6.2.2 Discretize Predictors as a Last Resort

Binning, also known as categorization or discretization, is the process of translating a qualitative variable into a set of two or more qualitative buckets. For example, a variable might be translated into quantiles; the buckets would be for whether the numbers fell into the first 25% of the data, between the 25th and median, etc. In this case, there would be four distinct values of the binned version of the data.

There are a few apparent reasons for subjecting the data to such a transformation:

  • Some feel that it simplifies the analysis and/or interpretation of the results. Suppose that a person’s age was a predictor and this was binned by whether someone was above 40 years old or not. One might be able to make a statement that there is a 25% increase in the probability of the event between younger and older people. There is no discussion of per-unit increases in the response.

  • Binning may avoid the problem of having to specify the relationship between the predictor and outcome. A set of bins can be perceived as being able to model more patterns without having to visualize or think about the underlying pattern.

  • Using qualitative predictors may give the perception that it reduces the variation in the data . This is discussed at length below.

There are a number of methods for binning data. Some are unsupervised and are based on user-driven cutoffs or estimated percentiles (e.g. the median). In other cases, the placement (and number) of cut-points are optimized to improve performance. For example, if a single split were required, an ROC curve (Section 3.2.2) could be used to find an appropriate exchange of sensitivity and specificity.

There are a number of problematic issues with turning continuous data categorical. First, it is extremely unlikely that the underlying trend is consistent with the new model. Secondly, when a real trend exists, discretizing the data is most likely making it harder for the model to do an effective job since all of the nuance in the data has been removed. Third, there is probably no objective rationale for a specific cut-point. Fourth, when there is no relationship between the outcome and the predictor, there is a substantial increase in the probability that an erroneous trend will be “discovered”. This has been widely research and verified. See Altman (1991), Altman et al. (1994), and the references therein.

Kenny and Montanari (2013) do an exceptional job of illustrating how this can occur and our example follows their research. Suppose a set of variables are being screened to find which have a relationship between a numeric outcome. Figure 6.6(a) shows an example of a simulated data set with a linear trend with a coefficient of 1.0 and normally distributed errors with a standard deviation of 4.0. A linear regression of these data (shown) would find an increasing trend and would have an estimated \(R^2\) of 30.3%. The data were cut into 10 equally spaced bins and the average outcome was estimated per bin These data are shown in panel (b) along with the regression line. Since the average is being used, the estimated \(R^2\) is much larger (64.2%) since the model thinks that the data are more precise (which, in reality, they are not).

Raw data (a) and binned (b) simulation data. Also, the results of a simulation (c) where the predictor is noninformative.

Fig. 6.6: Raw data (a) and binned (b) simulation data. Also, the results of a simulation (c) where the predictor is noninformative.

Unfortunately, this artificial reduction in variation can lead to a high false positive rate when the predictor is unrelated to the outcome. To illustrate this, the same data were used but with a slope of zero (i.e., the predictor is uninformative). This simulation was run a large number of times with and without binning the data and Figure 6.6(c) shows the distribution of the estimated \(R^2\) values across the simulations. For the raw data, the largest \(R^2\) value seen in 1000 data sets was 18.3% although most data are less than 20%. When the data were discretized, the \(R^2\) values tended to be much larger (with a maximum of 76.9%). In fact, 19% of the simulations had \(R^2 \ge 20\)%.

There is a possibility that a discretization procedure might improve the model; the No Free Lunch Theorem discounts the idea that a methodology will never work. If a categorization strategy is to be used, especially if it is supervised, we suggest that:

  1. The strategy should not be part of normal operating procedure. Categorizing predictors should be a method of last resort.
  2. The determination of the bins must be included inside of the resampling. process. This will help diagnose when nonexistent relationships are being induced by the procedure and will mitigate the overestimation of performance when an informative predictor is used. Also, since some sort of procedure would have to be defined to be included inside of resampling, it would prevent the typical approach of “eyeballing” the data to pick cut-points. We have found that a binning strategy based solely on visual inspection of the data has a higher propensity of overfitting.

6.3 Many:Many Transformations

When creating new features from multiple predictors, there is a possibility of correcting a variety of issues such as outliers or collinearity. It can also help reduce the dimensionality of the predictor space in ways that might improve performance and reduce computational time for models.

6.3.1 Linear Projection Methods

Observational or available data sets nowadays contain many predictors, perhaps more than the number of available samples. The predictor soup ingredients are usually anything we may (or may not) consider as potentially related the response. A conventional thought process is that the more ingredients we include, the better the soup will be. But this is not the case; including irrelevant predictors can have detrimental effects on the final model (see Chapter 11). The negative impacts include: increasing the computational time required for training a model, decreasing the predictive performance, and complicating the predictor importance calculation. By including any potential predictor, we end up with a watered down soup that is eventually tossed to the pigs.

The tension we have in working with modern data is that we often do not fully know which predictors are relevant to the response. What may in fact be true is that some of the available predictors (or combinations) have a meaningful association with the response. Under these conditions, our task is to try to identify the simple or complex combination that is relevant to the response.

In this section we will cover projection methods that have been been shown to effectively identify meaningful projections of the original predictors. The specific techniques that we will focus on here are principal components analysis (PCA), kernel PCA, independent component analysis (ICA), non-negative matrix factorization (NNMF), and partial least squares (PLS). The first four techniques in this list are unsupervised techniques that are ignorant of the response. The final technique uses the response to direct the dimension reduction process so that the reduced predictor space is optimally associated with the response.

These methods are linear in the sense that they take a matrix of numeric predictor values (\(X\)) and create new components that are linear combinations of the original data. These components are usually called the scores. If there are \(n\) training set points and \(p\) predictors, the scores could be denoted as the \(n \times p\) matrix \(X^*\) and are created using

\[ X^* = X A \]

where the \(p \times p\) matrix \(A\) is often called the projection matrix. As a result, the value of the first score for data point i would be:

\[ x^*_{i1} = a_{11} x_{i1} + a_{12}x_{i2} + \ldots + a_{1p}x_{ip} \]

One possibility is to only use a portion of \(A\) to reduce the number of scores down to \(k < p\). For example, it may be that the last few columns of \(A\) show no benefit to describing the predictor set. This notion of dimension reduction should not be confused with feature selection. If we can describe the predictor space with \(k\) score variables, there are less columns in the data given to the model. However, each of these columns is a function of all of the original predictors.

The differences between the projection methods mentioned above48 is in how the projection values are determined.

As we will see below, each of the unsupervised techniques has its own objective for summarizing or projecting the original predictors into a lower dimensional subspace of size \(k\) by using only the first \(k\) columns of \(A\). By condensing the predictor information, we can reduce predictor redundancy and tease out potential signals (as opposed to the noise) within the subspace. A direct benefit of condensing information to a subspace of the predictors is a decrease in the computational time required to train a model. But we must be aware that employing an unsupervised technique does not guarantee an improvement in predictive performance. The primary underlying assumption we make when using an unsupervised projection method is that the identified subspace is predictive of the response. This may or may not be true since the objective of unsupervised techniques is completely disconnected with the response. At the conclusion of model training, we may find that unsupervised techniques are convenient tools to condense the original predictors, but do not help to effectively improve predictive performance49.

The supervised technique we will illustrate, PLS, improves on unsupervised techniques by focusing on the objective of finding a subspace of the original predictors that is directly associated with the response. Generally, this approach finds a smaller and more efficient subspace of the original predictors. The drawback of using a supervised method is that we need enough a larger data set to reserve samples for testing and validation to avoid overfitting.

6.3.1.1 Principal Component Analysis

We first encountered principal components analysis as an exploratory visualization technique in Section 4.2.7. We will now explore PCA in more detail and investigate how and when it can be effectively used to create new features. Principal component analysis was originally developed by Karl Pearson more than a century ago and, despite its age, is still one of the most widely used dimension reduction tools. A good reference for this method is Abdi and Williams (2010). Its relevance in modern data stems from the underlying objective of the technique. Specifically, the objective of PCA is to find linear combinations of the original predictors such that the combinations summarize the maximal amount of variation in the original predictor space. From a statistical perspective, variation is synonymous with information. So, by finding combinations of the original predictors that capture variation, we find the subspace of the data that contains the information relevant to the predictors. Simultaneously, the new scores are required to be orthogonal (i.e. uncorrelated) to each other. The property of orthogonality enables the predictor space variability to be neatly partitioned in a way that does not overlap. It turns out that the variability summarized across all of the principal components is exactly equal to the variability across the original predictors. In practice, we compute enough new score variables to account for some pre-specified amount of the variability in the original predictors (e.g., 95%).

An important side benefit of this technique is that the resulting PCA scores are uncorrelated. This property is very useful for modeling techniques (e.g. multiple linear regression, neural networks, support vector machines, and others) that need the predictors to be relatively uncorrelated. This is discussed again in the context of ICA below.

PCA is a particularly useful tool when the available data are composed of one or more clusters of predictors that contain redundant information (e.g. predictors that are highly correlated with one another). Predictors that are highly correlated can be thought of as existing in a lower dimensional space than the original data. As a trivial example, consider the scatter plot in Figure 6.7 (a), where two predictors are linearly related. While this data has been measured in two dimensions, we really only need one dimension (a line) to adequately summarize the position of the samples. That is, these data could be nearly equivalently represented by a combination of these two variables as displayed in part (b) of the figure. The red points in this figure are the orthogonal projections of the original data onto the first principal component score (imagine the diagonal line being rotated to horizontal). The new feature represents the original data by one numeric value which is the distance from the origin to the projected value. After creating the first PCA score, the “leftover” information is represented by the length of the blue lines in panel (b). These distances would be the final score (i.e. the second component).

Principal component analysis of two highly correlated predictors.  (a) A scatter plot of two highly correlated predictors with the first principal direction represented with a dashed line.  (b) The orthogonal projection of the original data onto the first principal direction.  The new feature is represented by the red points.

Fig. 6.7: Principal component analysis of two highly correlated predictors. (a) A scatter plot of two highly correlated predictors with the first principal direction represented with a dashed line. (b) The orthogonal projection of the original data onto the first principal direction. The new feature is represented by the red points.

For this simple example, a linear regression model was built using the two correlated predictors. A linear regression model was also built between the first principal component score and the response. Figure 6.8 provides the relationship between the observed and predicted values from each model. For this example, there is virtually no difference in performance between these models.

A linear combination of the two predictors chosen to summarize the maximum variation contains essentially the same information as both of the original predictors.

Fig. 6.8: A linear combination of the two predictors chosen to summarize the maximum variation contains essentially the same information as both of the original predictors.

Principal component regression, which describes the process of first reducing dimension and then regressing the new component scores onto the response is a stepwise process (Massy 1965). The two steps of this process are disjoint which means that the dimension reduction step is not directly connected with the response. Therefore, the principal component scores associated with the variability in the data may or may not be related to the response.

To provide a practical example of PCA, let’s return to the Chicago ridership data that we first explored in Chapter 4. Recall that ridership across many stations was highly correlated (Figure 4.9). Of the 125 stations, 94 had at least one pairwise correlation with another station greater than 0.9. What this means is that if we know the ridership at one station, we can estimate the ridership at many other stations with a high degree of accuracy. In other words, the information we need about ridership is more simply contained in a much lower dimension of this data.

The Chicago data can be used to demonstrate how projection methods work. Recall that the main driving force in the data are differences between weekend and weekday ridership. The gulf between the data for these two groups completely drives the principal component analysis since it accounts for most of what is going on in the data. Rather than showing this analysis, the weekend data were analyzed here50. There were 1628 days on either Saturday or Sunday in the training data. Using only the 14-day lag predictors, PCA was conducted on the weekend data after the columns were centered and scaled.

Score values from several linear projection methods for the weekend ridership at Clark and Lake. The x-axis values are the scores for each method and the y-axis is ridership (in thousands).

Fig. 6.9: Score values from several linear projection methods for the weekend ridership at Clark and Lake. The x-axis values are the scores for each method and the y-axis is ridership (in thousands).

The left-most row of Figure 6.9 shows the first four components and their relationships to the outcome. The first component, which accounts for 60.4% of the variation in the lagged predictors, shows a strong linear relationship with ridership. For this component, all of the coefficients of the projection matrix \(A\) corresponding to this linear combination are positive and five most influential stations were 35th/Archer, Ashland, Austin, Oak Park, Western51. The five least important stations, to the first component, were Washington/Wells, Linden, Garfield, O’Hare, Central.

The second component only accounts for 5% of the variability and has less correlation with the outcome. In fact, the remaining components capture smaller and smaller amounts of information in the lagged predictors. To cover 95% of the original information, a total of 36 principal components would be required.

One important question that PCA might help answer is “Is there a line effect on the weekends?” To help answer this, the first five components are visualized in Figure 6.10 using a heatmap. In the heatmap, the color of the points represent the signs and magntiude of the elements of \(A\). The rows have been reordered based on a clustering routine that uses the five components. The row labels show the line (or lines) for the corresponding station. Hovering over a cell in the heatmap shows the station name, lines, and projection coeffcieint.

Fig. 6.10: A heatmap of the first five principal component coefficients. The rows are clustered using all five components. Clicking on the point shows the station, line, and coefficient value.

The first column shows that the relative weight for each station to the first component is reliably positive and small (in comparison to the other components); there is no line effect in the main underlying source of information. The other components appear to isolate specific lines (or subsets within lines). For example, the second component is strongly driven by stations along the Pink line (the dark red section), a section of the Blue line going towards the northwest, and a portion of the Green line that runs northbound of the city. Conversely, the fourth component has a strong negative cluster for the same Green line stations. In general, the labels along the y-axis show that the lines are fairly well clustered together, which indicates that there is a line effect after the first component’s effects are removed.

Despite the small amount of variation that the second component captures, we’ll use these coefficients in \(A\) to accentuate the geographic trends in the weekend data. The second PCA component’s rotation values have both positive and negative values. Their sign and magnitude can help inform us as to which stations are driving the this component. As seen in Figure 6.11, the stations on the Red and Purple lines52 that head due north along the coast have substantial positive effects on the second component. Southwestern stations along the Pink line have large negative effects (as shown by the larger red markers) as did the O’Hare station (in the extreme northwest). Also, many of the stations clustered around Clark and Lake have small or moderately negative effects on the second station. It is interesting that two stations in close proximity (Illinois Medical District and Polk) can have substantial effects on the component but have opposing signs. These two stations are on different lines (Blue and Pink, respectively) and this may be another indication that there is a line effect in the data once the main piece of information in the data (i.e, the first component) is removed.

Fig. 6.11: Coefficients for the second principal components. The size is relative magnitude of the coefficient while red indicates negativity, blue is positivity, and white denotes values near zero. Clicking on the point shows the station, line, and absolute rank.

There are a variety of methods for visualizing the PCA results. It is common to plot the first few scores against one another to look for separation between classes, outliers, or other interesting patterns. Another visualization is the biplot (Greenacre 2010), an informative graphic that overlays the scores along with the coefficients of \(A\). In our data, there are 126 highly-correlated stations involved in the PCA analysis and, in this case, the biplot is not terribly effective.

One other note about plotting the PCA results. Since each PCA component captures less and less variability, it is common for the scale of the scores to become smaller and smaller. For this reason, it is important to plot the component scores on a common axis scale that accounts for all of the data in the scores (i.e. the largest possible range). In this way, small patterns in components that capture a small percentage of the information in the data will not be over-interpreted.

6.3.1.2 Kernel Principal Component Analysis

Principal component analysis is an effective dimension reduction technique when predictors are linearly correlated and when the resulting scores are associated with the response. However, the orthogonal partitioning of the predictor space may not provide a good predictive relationship with the response, especially if the true underlying relationship between the predictors and the response is non-linear. Consider, for example, a hypothetical relationship between a response (\(y\)) and two predictors (\(x_1\) and \(x_2\)) as follows:

\[ y = x_1 + x^{2}_{1} + x_2 + x^{2}_{2} + \epsilon \] In this equation \(\epsilon\) represents random noise. Suppose also that the correlation between \(x_1\) and \(x_2\) is strong as in Figure 6.7. Applying traditional PCA to this example would summarize the relationship between \(x_1\) and \(x_2\) into one principal component. However this approach would ignore the important quadratic relationships that would be critically necessary for predicting the response. In this two-predictor example, a simple visual exploration of the predictors and the relationship between the predictors and the response would lead us to understanding that the quadratic versions of the predictors are required for predicting the relationship with the response. But the ability to visually explore for important higher-order terms diminishes as the dimension of the data increases. Therefore traditional PCA will be a much less effective dimension reduction technique when the data should be augmented with additional features.

If we knew that one or more quadratic terms were necessary for explaining the response, then we could add these terms to the predictors to be evaluated in the modeling process. Another approach would be to use a nonlinear basis expansion as described earlier in this chapter. A third possible approach is kernel PCA (Schölkopf, Smola, and Müller 1998, Shawe-Taylor and Cristianini (2004)). The kernel PCA approach combines a specific mathematical view of PCA with kernel functions and the kernel ‘trick’ to enable PCA to expand the dimension of the predictor space in which dimension reduction is performed. To better understand what kernel PCA is doing, we first need to recognize that variance summarization problem posed by PCA can be solved by finding the eigenvalues and eigenvectors covariance matrix of the predictors. It turns out that the form of the eigen decomposition of the covariance matrix can be equivalently written in terms of the outer product of the samples. This is know as the ‘dual’ representation of the problem. The outer-product is referred to as the linear kernel; the linear kernel between \(x_{1}\) and \(x_{2}\) is represented as:

\[ k\left(\mathbf{x}_{1},\mathbf{x}_{2} \right) = \langle \mathbf{x}_{1},\mathbf{x}_{2} \rangle = \mathbf{x}^{T}_1\mathbf{x}_2 \]

For a set of \(i = 1 \ldots n\) samples and two predictors, this corresponds to

\[ k\left(\mathbf{x}_{1},\mathbf{x}_{2} \right) = x_{11}x_{12} + \ldots + x_{n1}x_{n2} \]

Kernel PCA then uses the kernel substitution trick to exchange the linear kernel with any other valid kernel. For a concise list of kernel properties for constructing valid kernels, see Bishop (2011). In addition to the linear kernel, the basic polynomial kernel has one parameter, \(d\) and is defined as:

\[ k\left(\mathbf{x}_{1},\mathbf{x}_{2} \right) = \langle \mathbf{x}_{1},\mathbf{x}_{2} \rangle^d \] Again, with two predictors and \(d = 2\):

\[ k\left(\mathbf{x}_{1},\mathbf{x}_{2} \right) = (x_{11}x_{12} + \ldots + x_{n1}x_{n2})^2 \] Using the polynomial kernel, we can directly expand a linear combination as a polynomial of degree \(d\) (similar to what was done in for basis functions in Section 6.2.1).

The radial basis function (RBF) kernel has one parameter, \(\sigma\) and is defined as:

\[ k\left(\mathbf{x}_{1},\mathbf{x}_{2} \right) = \exp\left( -\frac{\lVert \mathbf{x}_{1} - \mathbf{x}_{2} \rVert^2}{2\sigma^2} \right)\]

The RBF kernel is also known as the Gaussian kernel due to its mathematical form and its relationship to the normal distribution. The radial basis and polynomial kernels are popular starting points for kernel PCA.

Utilizing this view enables the kernel approach to expand the original dimension of the data into high dimensions that comprise potentially beneficial representations of the original predictors. What’s even better is that the kernel trick side-steps the need to go through the hassle of creating these terms in the \(X\) matrix. Simply using a kernel function and solving the optimization problem directly explores the higher dimensional space.

Advantages are that we do not need to create these predictors; instead, we get these for free by using the kernel function. This approach is also computationally efficient, and is especially efficient when the number of predictors exceeds the number of observations.

A comparison of PCA and kernel PCA.  (a) The training set data.  (b) The simulated relation between $x_1$ and the outcome.  (c) The PCA results  (d) kPCA using a polynomial kernel shows better results.  (e) Residual distributions for the two dimension reduction methods.

Fig. 6.12: A comparison of PCA and kernel PCA. (a) The training set data. (b) The simulated relation between \(x_1\) and the outcome. (c) The PCA results (d) kPCA using a polynomial kernel shows better results. (e) Residual distributions for the two dimension reduction methods.

To illustrate the utility of kernel PCA, we will simulate data based on the model earlier in this section. For this simulation, we generated 200 samples, with 150 selected for the training set. We will use PCA and kernel PCA separately as dimension reduction techniques followed by linear regression to compare how these methods perform. Figure 6.12(a) displays the relationship between \(x_1\) and \(x_2\) for the training data. The scatter plot shows the linear relationship between these two predictors. Panel (b) uncovers the quadratic relationship between \(x_1\) and y which could be easily be found and incorporated into a model if there were a small number of predictors. PCA found one dimension to summarize \(x_1\) and \(x_2\) and the relationship between the principal component and the response is presented in Figure 6.12(c). Often in practice, we use the number of components that summarize a threshold amount of variability as the predictors. If we blindly ran principal component regression, we would get a poor fit between \(PC_1\) and \(y\) because this data is better described by a quadratic fit. The relationship between the response and the first component using a polynomial kernel of degree 2 is presented in panel (d). The expansion of the space using this kernel allows PCA to better find the underlying structure among the predictors. Clearly, this structure has a more predictive relationship with the response than the linear method. The model performance on the test is presented in panel (e) where the distirbution of the residuals from the kernel PCA model are demonstrably smaller than those from the PCA model.

For the Chicago data, a radial basis function kernel was used to get new PCA components. The kernel’s only parameter (\(\sigma\)) was estimated analytically using the method of Caputo et al. (2002); values between 0.002 and 0.018 seemed reasonable and a value near the middle of the range (0.006) was used to compute the components. Sensitivity analysis was done to assess if others choices within this range led to appreciably different results, which they did not. The second row from the left in Figure 6.9 shows the results. Similar to ordinary PCA, the first kPCA components has a strong association with ridership. However, the kernel-based component has a larger dynamic range on the x-axis. The other components shown in the figure are somewhat dissimilar from regular PCA. While these scores have less association with the outcome (in isolation), it is very possible that, when the scores are entered into a regression model simultaneously, they have predictive utility.

6.3.1.3 Independent Component Analysis

PCA seeks to account for the variation in the data and produces components that are mathematically orthogonal to each other. As a result, the components are uncorrelated. However, since correlation is a measure of linear independence between two variables, they may not be statistically independent of one another (i.e., have a covariance of zero). There are a variety of toy examples of scatterplots of uncorrelated variables containing a clear relationship between the variables. One exception occurs when the underlying values follow a Gaussian distribution. In this case, uncorrelated data would also be statistically independent53.

Independent component analysis (ICA) is similar to PCA in a number of ways (Lee 1998). It creates new components that are linear combinations of the original variables but does so in a way that the components are as statistically independent from one another as possible. This enables ICA to be able to model a broader set of trends than PCA, which focuses on orthogonality and linear relationships. There are a number of ways for ICA to meet the constraint of statistically independence and often the goal is to maximize the “nongaussianity” of the resulting components. There are different ways to measure nongaussianity and the fastICA approach uses an information theory called negentropy to do so (Hyvarinen and Oja 2000). In practice, ICA should create component that are dissimilar from PCA unless the predictors demonstrate significant multivariate normality or strictly linear trends. Also, unlike PCA, there is no unique ordering of the components.

It is common to normalize and whiten the data prior to running the ICA calculations. Whitening, in this case, means to convert the original values to the full set of PCA components. This may seem counter intuitive, but this preprocessing doesn’t negatively affect the goals of ICA and reduces the computational complexity substantially (Roberts and Everson 2001). Also, the ICA computations are typically initialized with random parameter values and the end result can be sensitive to these values so, to achieve reproducible results, it is helpful to initialize the values manually so that the same results can be recreated at some later time.

ICA was applied to the data. Prior to estimating the components, the predictors were scaled so that the pre-ICA step of whitening used the centered data to compute the initial PCA data. When scaling was not applied, the ICA components were strongly driven by different individual stations. The ICA components are shown in the third row of Figure 6.9 and appear very different from the other methods. The first component has very little relationship with the outcome nad contains a set of outlying values. In this component, most stations had little to no contribution to the score. Many stations that had an effect were near Clark and Lake. While there were 3 stations that negatively affected the component (California, Damen, Harrison) most has positive coefficients (Western, Logan Square). The other ICA components also had very weak relationship to the outcome. For these data, the PCA approach to finding appropriate rotations of the data was better at finding predictive relationships.

6.3.1.4 Non-negative Matrix Factorization

Non-negative matrix factorization (Gillis 2017,Fogel et al. (2013)) is another linear projection method that is specific to features that are are positive or zero. In this case, the algorithm finds the coefficients of \(A\) such that their values are also non-negative (thus ensuring that the new features have the same property). This approach is popular for text data where predictors are word counts, imaging, and biological measures (e.g. the amount of RNA expressed for a specific gene). In out case, it can be used for ridership counts although we have been using the logged values as predictors (none of which have negative values).

The method for determining the coefficients is conceptually simple: find the best set of coefficients that make the scores as “close” as possible to the original data with the constraint of non-negativity. Closeness can be measured using mean squared error (aggregated across the predictors) or other measures such as the Kullback–Leibler divergence (Cover and Thomas 2012). The latter is an information theoretic method to measure the distance between two probability distributions. Like ICA, the numerical solver is initialized using random numbers and the final solution can be sensitive to these values and the order of the components is arbitrary.

The first four non-negative matrix factorization components show in Figure 6.9 had modest relationships with ridership at Clark and Lake station54. The third component appears most predictive and appeared to be driven by stations on the Green and Pink lines as well as a few stations along the Red line near the downtown loop as shown in Figure 6.11.

Fig. 6.13: Coefficients for the third NNMF component. The size and color reflect the relative magnitude of the coefficients. Clicking on the point shows the station, line, and absolute rank.

6.3.1.5 Partial Least Squares

The techniques discussed thus far are unsupervised, meaning that the objective of the dimension reduction is based solely on the predictors. In contrast, supervised dimension reduction techniques use the response to guide the dimension reduction of the predictors such that the new predictors are optimally related to the response. Partial least squares (PLS) is a supervised version of PCA that reduces dimension in a way that is optimally related to the response (Stone and Brooks 1990). Specifically, the objective of PLS is to find linear functions (called latent variables) of the predictors that have optimal covariance with the response. This means that the response guides the dimension reduction such that the scores have the highest possible correlation with the response in the training data. The typical implementation of the optimization problem requires that the data in each of the new dimensions is uncorrelated, which is a similar constraint which is imposed in the PCA optimization problem. Because PLS is a supervised technique, we have found that it takes more PCA components to meet the same level of performance generated by a PLS model (Kuhn and Johnson 2013). Therefore, PLS can more efficiently reduce dimension, leading to both memory and computational time savings during the model building process. However, because PLS uses the response to determine the optimal number of latent variables, we must employ a rigorous validation approach to ensure that the method does not overfit to the data.

The PLS components for the Chicago weekend data are shown in Figure 6.9 The first component is effectively the same as the first PCA component. Recall that PLS tries to balance dimension reduction in the predictors with maximizing the correlation between the components and the outcome. In this particular case, as shown in the Figure, the PCA component has a strong linear relationship with ridership. In effect, the two goals of PLS are accomplished by the unsupervised component. For the other components shown in Figure 6.9, the second component is also very similar to the corresponding PCA version (up to the sign), but the other two components are dissimilar and show marginally better association with ridership.

To quantify how well each method was at estimating the relationship between the outcome and the lagged predictors, linear models were fit using each method. In each case, 20 components were computed and entered into a simple linear regression model. To resample, a similar time-based scheme was used where 800 weekends were the initial analysis set and two sets of weekends were used as an assessment set (which are added into analysis set for the next iteration). This led to 25 resamples. Within each resample, the projection methods were recomputed each time and this replication should capture the uncertainty potentially created by those computations. If we had pre-computed the components for the entire training set of weekends, and then resampled, it is very likely that the results would be falsely optimistic. Figure 6.14 shows confidence intervals for the RMSE values over resamples for each method. Although the confidence intervals overlap, there is some indication that PLS, NNMF, and kernel PCA have potential to outperform the original values. The potential upside for this analysis is also that the decorrelated components don’t substantially lose vital information. For other models that can be very sensitive to between-predictor correlations (e.g. neural networks), these transformation methods could provide materially better performance.

Model results for the Chicago weekend data using different dimension reduction methods (each with 20 components).

Fig. 6.14: Model results for the Chicago weekend data using different dimension reduction methods (each with 20 components).

6.3.2 Autoencoders

Autoencoders are computationally complex multivariate methods for finding representations of the predictor data and are commonly used in deep learning models (Goodfellow, Bengio, and Courville 2016). The idea is to create a nonlinear mapping between the original predictor data and a set artificial features (that is usually the same size). These new features, which may not have any sensible interpretation, are then used as the model predictors. While this does sound very similar to the previous projection methods, autoencoders are very different in terms of how the new features are derived and also in their potential benefit.

One situation to use an autoencoder is when there is an abundance of unlabeled data (i.e. the outcome is not yet known). For example, a pharmaceutical company will have access to millions of chemical compounds but a very small percentage of these data have relevant outcome data needed for new drug discovery projects. One common task in chemistry is to create models that use the chemical structure of a compound to predict how good of a drug it might be. The predictors can be calculated for the vast majority of the chemical database and these data, despite the lack of laboratory results, can be used to improve a model.

As an example, suppose that a modeler was going to fit a simple linear regression model. The well-know least squares model would have predictor data in a matrix denoted as \(X\) and the outcome results in a vector \(y\). To estimate the regression coefficients, the formula is

\[\hat{\beta} = (X'X)^{-1} \quad X'y\]

Note that the outcome data are only used on the right-hand part of that equation (e.g. \(X'y\)). The unlabeled data could be used to produce a very good estimate of the inverse term on the left and saved. When the outcome data are available, \(X'y\) can be computed with the relatively small amount of labeled data. The above equation can still be used to produce better parameter estimates when \((X'X)^{-1}\) is created on a large set of data.

The same philosophy can be used with autoencoders. All of the available predictor data can be used to create an equation that estimates (and potentially smooths) the relationships between the predictors. Based on these patterns, a smaller data set of predictors can be more effectively represented by using their “predicted” feature values.

There are a variety of ways that autoencoders can be created. The most (relatively) straightforward method is to use a neural network structure:

In this architecture, the functions that go from the inputs to the hidden units and between hidden units are nonlinear functions. Commonly used functions are simple sigmoids as well as ReLu functions55. The connections to the output features are typically linear. One strategy is to use a relatively small number of nodes in the hidden units to coerce the autoencoder to learn the crucial patterns in the predictors. It may also be that multiple layers with many nodes optimally represent the predictor values.

Each line in the network diagram represents a parameter to be estimated and this number can become excessively large. However, there are successful approaches, such as regularization and dropout methods, to control overfitting as well as a number of fairly efficient deep learning libraries that can east the computational burden. For this reason, the benefit of this approach occurs when there is a large amount of data.

The details of fitting neural network and deep learning models is beyond the scope of this book. Good references for the technical details are Bishop (2011) and Goodfellow et al. (2016) while Chollet and Allaire (2018) is an excellent guide to fitting these types of models56. Creating effective autoencoders entails model tuning and many other tasks used in predictive models in general. To measure performance, it is common to use root mean squared error between the observed and predicted (predictor) values, aggregated across variables.

How can autoencoders be incorporated into the modeling process? It is common to treat the relationships estimated by the autoencoder model as “known” values and not re-estimated during resampling. In fact, it is fairly common in deep learning on images and text data to use pre-trained networks in new projects. These pre-trained sections or blocks of networks can be used as the front of the network. If there is a very large amount of data that go into this model, and the model fitting process converged, there may be some statistical validity to this approach.

As an example, we used the data from Karthikeyan, Glen, and Bender (2005) where they used chemical descriptors to model the melting point of chemical compounds (i.e. transition from solid to liquid state). They assembled 4401 compounds and 202 numeric predictors. To simulate a drug discovery project just starting up, a random set of 50 data points were used as a training set and another random set of 25 were allocated to a test set. The remaining 4327 data points were treated as unlabeled data that do not have melting point data.

The unlabeled predictors were processed by first eliminating 126 predictors so that there are no pairwise correlations above 0.75. Further, a Yeo-John transformation was used to change the scale of the remaining 76 predictors (to make them less skewed). Finally, all the predictors were centered and scaled. The same preprocessing was applied to the training and test sets.

An autoencoder with two hidden layers, each with 512 units, was used to model the predictors. Rectified linear unit (ReLU) activation was used to go between the original units and the first set of hidden units, hyperbolic tangents was used between the two hidden layers, and a linear function connected the second set of hidden units and the 76 output features. The total number of model parameters was 341,068. The model was trained across 500 iterations and the aggregate mean squared error was optimized using a random 20% of the unlabeled data to measure performance as the model was being trained. Figure 6.15(a) shows that the MSE drastically decreases in the first 50 iterations then slowly declines over the remaining 450. The autoencoder model was applied to the training and test set.

The results of fitting and autoencoder to a drug discovery data set. Panel (a) is the holdout data used to measure the MSE of the fitting process and (b) the resampling profiles of K-nearest neighbors with (blue) and without (black) the application of the autoencoder.

Fig. 6.15: The results of fitting and autoencoder to a drug discovery data set. Panel (a) is the holdout data used to measure the MSE of the fitting process and (b) the resampling profiles of K-nearest neighbors with (blue) and without (black) the application of the autoencoder.

A K-nearest neighbor model was fit to the original and encoded versions of the training set. Neighbors between one and twenty were tested and the best model was chosen using the RMSE results measured using five repeats of 10-fold cross-validation. The resampling profiles are shown in Figure 6.15(b). The best settings for the encoded data (11 neighbors, with an estimated RMSE of 50.6) and the original data (7 neighbors, RMSE = 54.8) had a difference of 4.2 degrees with a 95% confidence interval of (1.3, 7.2) degrees. While this is a moderate improvement, for a drug discovery project, such an improvement could make a substantial impact if the model might be able to better predict the effectiveness of new compounds.

The test set was predicted with both KNN models. The model associated with the raw training set predictors did very poorly, producing an RMSE of 64.1 degrees. This could be because of the selection of training and test set compounds or for some other reason. The KNN model with the encoded values produced results that were consistent with the resampling results, with an RMSE value of 47. The entire process was repeated using different random numbers seeds in order to verify that the results were not produced stricktly by chance.

6.3.3 Spatial Sign

The spatial sign transformation takes a set of predictor variables and transforms them in a way that the new values have the same distance to the center of the distribution (Serneels, Nolf, and Espen 2006). In essence, the data are projected onto a multidimensional sphere using the following formula:

\[ x^*_{ij}=\frac{x_{ij}}{\sum^{p}_{j=1} x_{ij}^2} \] This approach is also referred to as global contrast normalization and is often used with image analysis.

The transformation required computing the norm of the data and, for that reason, the predictors should be centered and scaled prior to the spatial sign. Also, if any predictors have highly skewed distributions, they may benefit from a transformation to induce symmetry, such as the Yeo-Johnson technique.

For example, Figure 6.16(a) shows data from Reid (2015) where characteristics of animal scat (i.e. feces) were used to predict the true species of the animal that donated the sample. In this figure, two predictor variables colored with points colored by the true species. The taper index shows a significant outlier on the y-axis. When the transformation is applied to these two predictors, the new values in 6.16(b) show that the data are all equidistant from the center (zero). While the transformed data may appear to be awkward for use in a new model, this transformation can be very effective at mitigating the damage of extreme outliers.

Scatterplots of a classification data set before (a) and after (b) the spatial sign transformation.

Fig. 6.16: Scatterplots of a classification data set before (a) and after (b) the spatial sign transformation.

6.3.4 Distance and Depth Features

In classification, it may be beneficial to make semi-supervised features from the data. In this context, this means that the outcome classes are used in the creation of the new predictors but not in the sense that the features are optimized for predictive accuracy. For example, from the training set, the means of the predictors can be computed for each class (commonly referred to as the class centroids). For new data, the distance to the class centroids can be computed for each class and these can be used as features. This basically embeds quasi-nearest-neighbor technique into the model that uses these features. For example, Figure 6.17(a) shows the class centroids for the scat data using the log carbon-nitrogen ratio and \(\delta 13C\) features. Based on these multidimensional averages, it is straightforward to compute a new sample’s distance to each of the class’s centers. If the new sample is closer to a particular class, its distance should be small. Figure 6.17(a) shows and example for this approach where the new sample (denoted by a *) is relatively close to the bobcat and coyote centers. Different distance measures can be used where appropriate. To use this approach in a model, new features would be created for every class that measures the distance of every sample to the corresponding centroids.

Examples of class distance to centroids (a) and depth calculations (b). The asterisk denotes a new sample being predicted and the squares correspond to class centroids.

Fig. 6.17: Examples of class distance to centroids (a) and depth calculations (b). The asterisk denotes a new sample being predicted and the squares correspond to class centroids.

Alternatively, the idea of data depth can be used (Ghosh and Chaudhuri 2005,Mozharovskyi, Mosler, and Lange (2015)). Data depth measures how close data a data point is to the center of its distribution. There are many ways to compute depth and there is usually an inverse relationship between class distances and data depths (e.g. large depth values are more indicative of being close to the class centroid). If we were to assume bivariate normality for the data previously shown, the probability distributions can be computed within each class. The estimated Gaussian probabilities are shown as shadings in Figure 6.17(b).

In this case, measuring the depth using the probability distributions provides slightly increase sensitivity than the simple Euclidean distance shown in panel (a). In this case, the new sample being predicted resides inside of the probability zone for coyote more than the other species.

There are a variety of depth measures that can be used and many do not require specific assumptions regarding the distribution of the predictors and may also be robust to outliers. Similar to the distance features, class-specific depths can be computed and added to the model.

These types of features can accentuate the ability of some models to predict an outcome. For example, if the true class boundary is a diagonal line, most classification trees cannot easily would have difficultly emulating this pattern. Supplementing the predictor set with distance or depth features can enable such models in these situations.

Feature Selection

Box, GEP, and D Cox. 1964. “An Analysis of Transformations.” Journal of the Royal Statistical Society. Series B (Methodological), 211–52.

Yeo, I-K, and R Johnson. 2000. “A New Family of Power Transformations to Improve Normality or Symmetry.” Biometrika 87 (4):954–59.

Wood, Simon. 2006. Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC.

Eilers, P, and B Marx. 2010. “Splines, Knots, and Penalties.” Wiley Interdisciplinary Reviews: Computational Statistics 2 (6):637–53.

Golub, G, M Heath, and G Wahba. 1979. “Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter.” Technometrics 21 (2):215–23.

Yandell, B. 1993. “Smoothing Splines - a Tutorial.” The Statistician, 317–19.

Friedman, J. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics 19 (1):1–141.

Nair, V, and G. Hinton. 2010. “Rectified Linear Units Improve Restricted Boltzmann Machines.” In Proceedings of the 27th International Conference on Machine Learning, edited by Johannes Fürnkranz and Thorsten Joachims, 807–14. Omnipress.

Altman, D. 1991. “Categorising Continuous Variables.” British Journal of Cancer, no. 5:975.

Altman, D, B Lausen, W Sauerbrei, and M Schumacher. 1994. “Dangers of Using "Optimal" Cutpoints in the Evaluation of Prognostic Factors.” Journal of the National Cancer Institute 86 (11):829–35.

Kenny, P, and C Montanari. 2013. “Inflation of Correlation in the Pursuit of Drug-Likeness.” Journal of Computer-Aided Molecular Design 27 (1):1–13.

Abdi, H, and L Williams. 2010. “Principal Component Analysis.” Wiley Interdisciplinary Reviews: Computational Statistics 2 (4):433–59.

Massy, William F. 1965. “Principal Components Regression in Exploratory Statistical Research.” Journal of the American Statistical Association 60 (309). Taylor & Francis:234–56.

Greenacre, M. 2010. Biplots in Practice. Fundacion BBVA.

Schölkopf, Bernhard, Alexander Smola, and Klaus-Robert Müller. 1998. “Nonlinear Component Analysis as a Kernel Eigenvalue Problem.” Neural Computation 10 (5). MIT Press:1299–1319.

Shawe-Taylor, John, and Nello Cristianini. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press.

Bishop, C. 2011. Pattern Recognition and Machine Learning. Springer.

Caputo, B, K Sim, F Furesjo, and A Smola. 2002. “Appearance-Based Object Recognition Using Svms: Which Kernel Should I Use?” In Proceedings of Nips Workshop on Statistical Methods for Computational Experiments in Visual Processing and Computer Vision. Vol. 2002.

Lee, T-W. 1998. Independent Component Analysis. Springer.

Hyvarinen, A, and E Oja. 2000. “Independent Component Analysis: Algorithms and Applications.” Neural Networks 13 (4-5). Elsevier:411–30.

Roberts, S, and R Everson. 2001. Independent Component Analysis: Principles and Practice. Cambridge University Press.

Gillis, N. 2017. “Introduction to Nonnegative Matrix Factorization.” arXiv Preprint arXiv:1703.00663.

Fogel, P, D Hawkins, C Beecher, G Luta, and S Young. 2013. “A Tale of Two Matrix Factorizations.” The American Statistician 67 (4):207–18.

Cover, T, and J Thomas. 2012. Elements of Information Theory. John Wiley; Sons.

Stone, Mervyn, and Rodney J Brooks. 1990. “Continuum Regression: Cross-Validated Sequentially Constructed Prediction Embracing Ordinary Least Squares, Partial Least Squares and Principal Components Regression.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 237–69.

Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.

Goodfellow, I, Y Bengio, and A Courville. 2016. Deep Learning. MIT Press.

Goodfellow, I, Y Bengio, A Courville, and Y Bengio. 2016. Deep Learning. MIT press Cambridge.

Chollet, F, and JJ Allaire. 2018. Deep Learning with R. Manning.

Karthikeyan, M, R Glen, and A Bender. 2005. “General Melting Point Prediction Based on a Diverse Compound Data Set and Artificial Neural Networks.” Journal of Chemical Information and Modeling 45 (3):581–90.

Serneels, S, E De Nolf, and P Van Espen. 2006. “Spatial Sign Preprocessing: A Simple Way to Impart Moderate Robustness to Multivariate Estimators.” Journal of Chemical Information and Modeling 46 (3):1402–9.

Reid, R. 2015. “A Morphometric Modeling Approach to Distinguishing Among Bobcat, Coyote and Gray Fox Scats.” Wildlife Biology 21 (5). BioOne:254–62.

Ghosh, A K, and P Chaudhuri. 2005. “On Data Depth and Distribution-Free Discriminant Analysis Using Separating Surfaces.” Bernoulli 11 (1):1–27.

Mozharovskyi, P, K Mosler, and T Lange. 2015. “Classifying Real-World Data with the Dd \(\alpha\)-Procedure.” Advances in Data Analysis and Classification 9 (3). Springer:287–314.


  1. There are various approaches for the beginning and end of the data, such as leaving the data as is.

  2. Both variables are highly right skewed so a log transformation was applied to both axes.

  3. These methods also ensure that the derivatives of the function, to a certain order, are also continuous. This is one reason that cubic functions are typically used.

  4. Note that the MARS model sequentially generates a set of knots adaptively and determines which should be retained in the model.

  5. Except kernel PCA.

  6. This is a situation similar to the one described above for the Box-Cox transformation.

  7. We don’t suggest this as a realistic analysis strategy for these data, only to make the illustration more effective.

  8. In these case, “influential” means the largest coefficients.

  9. Recall Figure 4.1 for the station lines.

  10. This does not imply that PCA produces components that follow Gaussian distributions.

  11. The second, third, and fourth components are scaled in this figure since their ranges were very small in comparison to the first component. The rescaled values were not used anywhere else in these analyses.

  12. This helps prevent the model from learning a simple identity function from the data (that would provide no real value).

  13. Code for the example below can be found in the books GitHub repository.