## 6.2 1:Many Transformations

The previous chapter illustrated the process of creating multiple numeric indicator columns from a single qualitative predictor. Similarly, transformations can be made on a single numeric predictor to expand it to many predictors. These one-to-many transformations of the data can be used to improve model performance.

### 6.2.1 Nonlinear Features via Basis Expansions and Splines

A *basis expansion* of a predictor \(x\) can be achieved by deriving a set of functions \(f_i(x)\) that can be combined using a linear combination. For example, in the last chapter, polynomial contrast functions were used to encode ordered categories. For a continuous predictor \(x\), a cubic basis expansion is

\[ f(x) = \sum_{i=1}^3 \beta_i f_i(x) = \beta_1 x + \beta_2 x^2 + \beta_3 x^3 \]

To use this basis expansion in the model, the original column is augmented by two new features with squared and cubed versions of the original. The \(\beta\) values could be estimated using basic linear regression. If the true trend were linear, the second two regression parameters would presumably be near zero (at least relative to their standard error). Also, the linear regression used to determine the coefficients for the basis expansion could be estimated in the presence of other regressors.

This type of basis expansion, where the pattern is applied *globally* to the predictor, can often be insufficient. For example, take the lot size variable in the Ames housing data. When plotted against the sale price^{54}, there is a linearly increasing trend in the mainstream of the data (between \(10^{3.75}\) and \(10^{4.25}\)) but on either side of the bulk of the data, the patterns are negligible.

An alternative method to creating a global basis function that can be used in a regression model is a *polynomial spline* (Wood 2006, Eilers and Marx (2010)). Here, the basis expansion creates different regions of the predictor space whose boundaries are called *knots*. The polynomial spline uses polynomial functions, usually cubic, within each of these regions to represent the data. We would like the cubic functions to be connected at the knots, so specialized functions can be created to ensure an overall continuous function^{55}. Here, the number of knots controls the number of regions and also the potential complexity of the function. If the number of knots are low (perhaps three or less), the basis function can represent relatively simple trends. Functions with more knots are more adaptable but are also more likely to overfit the data.

The knots are typically chosen using percentiles of the data so that a relatively equal amount of data are contained within each region. For example, a spline with three regions typically places the knots at the 33.3% and 66.7% percentiles. Once the knots are determined, the basis functions can be created and used in a regression model. There are many types of polynomial splines and the approach described here is often called a *natural cubic spline*.

For the Ames lot area data, consider a spline with 6 regions. The procedure places the knots at the following percentiles: 16.7%, 33.3%, 50%, 66.7%, 83.3%. When creating the natural spline basis functions, the first function is typically taken as the intercept. The other functions of \(x\) generated by this procedure are shown in Figure 6.4(a). Here the blue lines indicate the knots. Note that the first three features contain regions where the basis function value is zero. This implies that those features do not affect that part of the predictor space. The other three features appear to put the most weight on areas outside the mainstream of the data. Panel (b) of this figure shows the final regression form. There is a strong linear trend in the middle of the cloud of points and weak relationships outside of the mainstream. The left-hand side does not appear to fit the data especially well. This might imply that more knots are required.

How many knots should be used? This is a tuning parameter that can be determined using grid search or through visual inspection of the smoother results (e.g. Figure 6.4(b)). In a single dimension, there will be visual evidence if the model is overfitting. Also, some spline functions use a method called *generalized cross-validation* (GCV) to estimate the appropriate spline complexity using a computational shortcut for linear regression (Golub, Heath, and Wahba 1979).

Note that the basis function was created *prior* to being exposed to the outcome data. This does not have to be the case. One method of determining the optimal complexity of the curve is to initially assign *every* training set point as a potential knot and uses regularized regression models (similar to weight decay or ridge regression) to determine which instances should be considered to be knots. This approach is used by the *smoothing spline* methodology (Yandell 1993). Additionally, there is a rich class or models called *generalized additive models* (GAMs). These extend general linear models, which includes linear and logistic regression, to have nonlinear terms for individual predictors (and cannot model interactions). GAM models can adaptively model separate basis functions for different variables and estimate the complexity for each. In other words, different predictors can be modeled with different levels of complexity. Additionally, there are many other types of supervised nonlinear smoothers that can be used. For example, the `loess`

model fits a weighted moving regression line across a predictor to estimate the nonlinear pattern in the data. See Wood (2006) for more technical information about GAM models.

One other feature construction method related to splines and the multivariate adaptive regression spline (MARS) model (Friedman 1991) is the single, fixed knot spline. The *hinge function* transformation used by that methodology is

\[h(x) = x I(x > 0)\]

where \(I\) is an indicator function that is \(x\) when \(x\) is greater than zero and zero otherwise. For example, if the log of the lot area were represented by \(x\), \(h(x - 3.75)\) generates a new feature that is zero when the log lot area is less than 3.75 and is equal to \(x - 3.75\) otherwise. The opposite feature, \(h(3.75 - x)\), is zero above a value of 3.75. The effect of the transformation can be seen in Figure 6.5(a) which illustrates how a pair of hinge functions isolate certain areas of the predictor space.

These features can be added to models to create *segmented regression models* that have distinct sections with different trends. As previously mentioned, the lot area data in Figure 6.4 exhibits a linear trend in the central region of the data and flat trends on either extreme. Suppose that two sets of hinge functions were added to a linear model with knots at 3.75 and 4.25. This would generate a separate linear trend in the region below a value of \(10^{3.75}\), between \(10^{3.75}\) and \(10^{4.25}\), and above \(10^{4.25}\). When added to a linear regression, the left-most region’s slope would be driven by the “left-handed” hinge functions for both knots. The middle region would involve the slopes for both terms associated with a knot of \(10^{3.75}\) as well as the left-handed hinge function with the knot at \(10^{4.25}\) and so on^{56}. The model fit associated with this strategy is shown in Figure 6.5(b).

This feature generation function is known in the neural network and deep learning fields as the rectified linear unit (ReLU) activation function. Nair and Hinton (2010) gives a summary of their use in these areas.

While basis functions can be effective at helping build better models by representing nonlinear patterns for a feature, they can be very effective for exploratory data analysis. Visualizations, such as Figures 6.4(b) and 6.5, can be used to inform the modeler about what the potential functional form of the prediction should be (e.g. log-linear, quadratic, segmented, etc.).

### 6.2.2 Discretize Predictors as a Last Resort

Binning, also known as categorization or discretization, is the process of translating a quantitative variable into a set of two or more qualitative buckets (i.e., categories). For example, a variable might be translated into quantiles; the buckets would be for whether the numbers fell into the first 25% of the data, between the 25th and median, etc. In this case, there would be four distinct values of the binned version of the data.

There are a few apparent reasons for subjecting the data to such a transformation:

Some feel that it simplifies the analysis and/or interpretation of the results. Suppose that a person’s age was a predictor and this was binned by whether someone was above 40 years old or not. One might be able to make a statement that there is a 25% increase in the probability of the event between younger and older people. There is no discussion of per-unit increases in the response.

Binning

*may*avoid the problem of having to specify the relationship between the predictor and outcome. A set of bins can be perceived as being able to model more patterns without having to visualize or think about the underlying pattern.Using qualitative versions of the predictors may give the perception that it reduces the variation in the data. This is discussed at length below.

There are a number of methods for binning data. Some are *unsupervised* and are based on user-driven cutoffs or estimated percentiles (e.g. the median). In other cases, the placement (and number) of cut-points are optimized to improve performance. For example, if a single split were required, an ROC curve (Section 3.2.2) could be used to find an appropriate exchange of sensitivity and specificity.

There are a number of problematic issues with turning continuous data categorical. First, it is extremely unlikely that the underlying trend is consistent with the new model. Secondly, when a real trend exists, discretizing the data is most likely making it *harder* for the model to do an effective job since all of the nuance in the data has been removed. Third, there is probably no objective rationale for a specific cut-point. Fourth, when there is no relationship between the outcome and the predictor, there is a substantial increase in the probability that an erroneous trend will be “discovered”. This has been widely researched and verified. See Altman (1991), Altman et al. (1994), and the references therein.

Kenny and Montanari (2013) do an exceptional job of illustrating how this can occur and our example follows their research. Suppose a set of variables are being screened to find which have a relationship with a numeric outcome. Figure 6.6(a) shows an example of a simulated data set with a linear trend with a coefficient of 1.0 and normally distributed errors with a standard deviation of 4.0. A linear regression of these data (shown) would find an increasing trend and would have an estimated \(R^2\) of 30.3%. The data were cut into 10 equally spaced bins and the average outcome was estimated per bin These data are shown in panel (b) along with the regression line. Since the average is being used, the estimated \(R^2\) is much larger (64.2%) since the model *thinks* that the data are more precise (which, in reality, they are not).

Unfortunately, this artificial reduction in variation can lead to a high false positive rate when the *predictor is unrelated to the outcome*. To illustrate this, the same data were used but with a slope of zero (i.e., the predictor is uninformative). This simulation was run a large number of times with and without binning the data and Figure 6.6(c) shows the distribution of the estimated \(R^2\) values across the simulations. For the raw data, the largest \(R^2\) value seen in 1000 data sets was 18.3% although most data are less than 20%. When the data were discretized, the \(R^2\) values tended to be much larger (with a maximum of 76.9%). In fact, 19% of the simulations had \(R^2 \ge 20\)%.

There is a possibility that a discretization procedure might improve the model; the No Free Lunch Theorem discounts the idea that a methodology will *never* work. If a categorization strategy is to be used, especially if it is supervised, we suggest that:

- The strategy should not be part of normal operating procedure. Categorizing predictors should be a method of last resort.
- The determination of the bins
*must*be included inside of the resampling. process. This will help diagnose when nonexistent relationships are being induced by the procedure and will mitigate the overestimation of performance when an informative predictor is used. Also, since some sort of procedure would have to be defined to be included inside of resampling, it would prevent the typical approach of “eyeballing” the data to pick cut-points. We have found that a binning strategy based solely on visual inspection of the data has a higher propensity of overfitting.

### References

Wood, S. 2006. *Generalized Additive Models: An Introduction with R*. Chapman; Hall/CRC.

Eilers, P, and B Marx. 2010. “Splines, Knots, and Penalties.” *Wiley Interdisciplinary Reviews: Computational Statistics* 2 (6):637–53.

Golub, G, M Heath, and G Wahba. 1979. “Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter.” *Technometrics* 21 (2):215–23.

Yandell, B. 1993. “Smoothing Splines - a Tutorial.” *The Statistician*, 317–19.

Friedman, J. 1991. “Multivariate Adaptive Regression Splines.” *The Annals of Statistics* 19 (1):1–141.

Nair, V, and G. Hinton. 2010. “Rectified Linear Units Improve Restricted Boltzmann Machines.” In *Proceedings of the 27th International Conference on Machine Learning*, edited by J Furnkranz and T Joachims, 807–14. Omnipress.

Altman, D. 1991. “Categorising Continuous Variables.” *British Journal of Cancer* 64 (5):975.

Altman, D, B Lausen, W Sauerbrei, and M Schumacher. 1994. “Dangers of Using "Optimal" Cutpoints in the Evaluation of Prognostic Factors.” *Journal of the National Cancer Institute* 86 (11):829–35.

Kenny, P, and C Montanari. 2013. “Inflation of Correlation in the Pursuit of Drug-Likeness.” *Journal of Computer-Aided Molecular Design* 27 (1):1–13.

Both variables are highly right skewed so a log transformation was applied to both axes.↩

These methods also ensure that the derivatives of the function, to a certain order, are also continuous. This is one reason that cubic functions are typically used.↩

Note that the MARS model sequentially generates a set of knots adaptively and determines which should be retained in the model.↩