1.1 A Simple Example

As a simple example of how feature engineering can affect models, consider Figure 1.2a that shows a plot of two correlated predictor variables (labeled as A and B). The data points are colored by their outcome, a discrete variable with two possible values (“PS” and “WS”). These data originate from an experiment from Hill et al. (2007) which includes a larger predictor set. For their task, a model would require a high degree of accuracy but would not need to be used for inference. For this illustration, only these two predictors will be considered. In this figure, there is clearly a diagonal separation between the two classes. A simple logistic regression model (Hosmer and Lemeshow 2000) will be used here to create a prediction equation from these two variables. That model uses the following equation:

\[log(p/(1-p)) = \beta_0 + \beta_1 A + \beta_2 B\]

where p is the probability that a sample is the “PS” class and the \(\beta\) values are the model parameters that need to be estimated from the data.

(a) An example data set and (b) the ROC curve from a simple logistic regression model.

Figure 1.2: (a) An example data set and (b) the ROC curve from a simple logistic regression model.

A standard procedure (maximum likelihood estimation) is used to estimate the three regression parameters from the data. The authors used 1009 data points to estimate the parameters (i.e., a training set) and reserved 1010 samples strictly for estimating performance (a test set)5. Using the training set, the parameters were estimated to be \(\hat{\beta_0} = 1.73\), \(\hat{\beta_1} = 0.003\), and \(\hat{\beta_2} = -0.064\).

To evaluate the model, predictions are made on the test set. Logistic regression naturally produces class probabilities that give an indication of likelihood for each class. While it is common to use a 50% cutoff to make hard class predictions, the performance derived from this default might be misleading. To avoid applying a probability cutoff, a technique called the receiver operating characteristic (ROC) curve is used here6. The ROC curve evaluates the results on all possible cutoffs and plots the true positive rate versus the false positive rate. The curve for this example is shown in Figure 1.2b. The best possible curve is one that is shifted as close as possible to the upper left corner while a ineffective model will stay along the dashed diagonal line. A common summary value for this technique is to use the area under the ROC curve where a value of 1.0 corresponds a perfect model while values near 0.5 are indicative of a model with no predictive ability. For the current logistic regression model, the area under the ROC curve is 0.794 which indicates moderate accuracy in classifying the response.

Given these two predictors variables, it would make sense to try different transformations and encodings of these data in an attempt to increase the area under the ROC curve. Since the predictors are both greater than zero and appear to have right-skewed distributions, one might be inclined to take the ratio A/B and enter only this term in the model. Alternatively, we could also evaluate if simple transformations of each predictor would be helpful. One method is the Box-Cox transformation7 which uses a separate estimation procedure prior to the logistic regression model that can put the predictors on a new scale. Using this methodology, the Box-Cox estimation procedure recommended that both predictors should be used on the inverse scale (i.e., 1/A instead of A). This representation of the data are shown in Figure 1.3a. When these transformed values were entered into the logistic regression model in lieu of the original values, the area under the ROC curve changed from 0.794 to 0.848, which is a substantial increase. Figure 1.3b shows both curves. In this case the ROC curve corresponding to the transformed predictors is uniformly better than the original result.

(a) The transformed predictors and (b) the ROC curve from both logistic regression models.

Figure 1.3: (a) The transformed predictors and (b) the ROC curve from both logistic regression models.

This example demonstrates how an alteration of the predictors, in this case a simple transformation, can lead to improvements to the effectiveness of the model. When comparing the data in Figures 1.2a and Figure 1.3a, it is easier to visually discriminate the two groups of data. In transforming the data individually, we enabled the logistic regression model to do a better job of separating the classes. Depending on the nature of the predictors, using the inverse of the original data might make inferential analysis more difficult.

However, different models have different requirements of the data. If the skewness of the original predictors was the issue affecting the logistic regression model, other models exist that do not have the same sensitivity to this characteristic. For example, a neural network can also be used to fit these data without using the inverse transformation of the predictors8. This model was able to achieve an area under the ROC curve of 0.848 which is roughly equivalent to the improved logistic model results. One might conclude that the neural network model is inherently always better than logistic regression since neural network was not susceptible to the distributional aspects of the predictors. But we should not jump to a blanket conclusion like this due to the “No Free Lunch” theorem (see Section 1.2.3). Additionally, the neural network model has its own drawbacks: it is completely uninterpretable and requires extensive parameter tuning to achieve good results. Depending on how the model will be utilized, one of these models might be more favorable than the other for these data.

A more comprehensive summary of how different features can influence models in different ways is given shortly in Section 1.3. Before this example, the next section discussed several key concepts that will be used throughout this text.

  1. These types of data sets are discussed more in Section 3.3.

  2. This technique is discussed in more detail in Section 3.2

  3. Discussed in Section 6.1.

  4. The predictor values were normalized to have mean zero and a standard deviation of one, as is needed for this model. However, this does not affect the skewness of the data.