6 Engineering Numeric Predictors

The previous chapter provided methods for skillfully modifying qualitative predictors. Often, other predictors have continuous, real number values. The objective of this chapter is to develope tools for converting these types of predictors into a form that a model can better utilize.

Predictors that are on a continuous scale are subject to a host of potential issues that we may have to to confront. Some of the problems that are prevalent with continuous predictors can be mitigated through the type of model that we choose. For example, models that construct relationships between the predictors and the response that are based on the rank of the predictors’ values rather than the actual value, like trees, are immune to predictor distributions that are skewed or to individual samples that have unusual values (i.e. outliers). Other models such as K-nearest neighbors and support vector machines are much more sensitive to predictors with skewed distributions or outliers. Continuous predictors that are highly correlated with each other is another regularly occurring scenario that presents a problem for some models but not for others. Partial least squares, for instance, is specifically built to directly handle highly correlated predictors. But models like multiple linear regression or neural networks are adversely affected in this situation.

If we desire to utilize and explore the predictive ability of more types of models, the issues presented by the predictors need to be addressed through engineering them in a useful way.

In this chapter we will provide such approaches for and illustrate how to handle continuous predictors with commonly occurring issues. The predictors may:

  • be on vastly different scales.
  • follow a skewed distribution where a small proportion of samples are orders of magnitude larger than the majority of the data (i.e. skewness).
  • contain a small number of extreme values.
  • be censored on the low and/or high end of the range.
  • have a complex relationship with the response and is truly predictive but cannot be adequately represented with a simple function or extracted by sophisticated models.
  • contain relevant and overly redundant information. That is, the information collected could be more effectively and efficiently represented with a smaller, consolidated number of new predictors while still preserving or enhancing the new predictors’ relationship with the response.

The techniques in this chapter have been organizedinto three general categories . The first category of engineering techniques are those that address problematic characteristics of individual predictors (Section 6.1). Section 6.2 illustrates methods for expanding individual predictors into many predictors in order to better represent more complex predictor-response relationships and to enable the extraction of predictive information. Last, Section 6.3 provides methodology for consolidating redundant information across many predictors. In the end, the goal of all of these approaches is to convert the existing continuous predictors into a form that can be utilized by any model and presents the most useful information to the model.

As with many techniques discussed in this text, the need for these can be very data- and model-dependent. For example, transformations to resolve skewness or outliers would not be needed for some models but would be critical for others to perform well. In this chapter, a guide to which models would benefit from specific preprocessing methods will be provided.