1.4 Feature Selection

In the previous example, new sets of features were derived sequentially to compensate to improve performance of the model. These sets were developed, added to the model, and then resampling was used to evaluate their utility. The new predictors were not prospectively filtered for statistical significance prior to adding them to the model. This would be a supervised procedure and care must be taken to make sure that overfitting is not occurring.

In that example, it was demonstrated that some of the predictors have enough underlying information to adequately predict the outcome (such as sets 1, 2, and 5). However, this collection of predictors might very well contain non-informative variables and this might impact performance to some extent. To whittle the predictor set to a smaller set that contains only the informative predictors, a supervised feature selection technique might be used. Additionally, there is the possibility that there are a small number of important predictors in sets 3 and 4 whose utility was not discovered because of all of the non-informative variables in these sets.

In other cases, all of the raw predictors are known and available at the beginning of the modeling process. In this case, a less sequential approach might be used by simply using a feature selection routine to attempt to sort out the best and worst predictors.

There are a number of different strategies for supervised feature selection that can be applied and these are discussed in Chapters 10 through 12. A distinction between search methods are how subsets are derived:

  • Wrapper methods use an external search procedure to choose different subsets of the whole predictor set to evaluate in a model. This approach separates the feature search process from the model fitting process. Examples of this approach would be backwards or stepwise selection as well as genetic algorithms.
  • Embedded methods are models where the feature selection procedure occurs naturally course of the model fitting process. Here an example would be a simple decision tree where variables are selected when the model uses them in a split. If a predictor is never used in a split, the prediction equation is functionally independent of this variable and it has been selected out.

As with model fitting, the main concern during feature selection is overfitting. This is especially true when wrapper methods are used and/or if the number of data points in the training set is small relative to the number of predictors.

Finally, unsupervised selection methods can have a very positive effect on model performance. Recall the Ames housing data. A property’s neighborhood might be a useful predictor in the model. Since most models require predictors to be represented as numbers, it is common to encode such data as dummy or indicator variables. In this case, the single neighborhood predictor, with 28 possible values, is converted to a set of 27 binary variables that have a value of one when a property is in that neighborhood and zero otherwise. While this is a well-known and common approach, here it leads to cases where 2 of the neighborhoods have only one or two properties in these data, which is less than 1% of the overall set. With such a low frequency, such a predictors might have a detrimental effect on some models (such as linear regression) and removing them prior to building the model might be advisable.

When conducting a search for a subset of variables, it is important to realize that there may not be a unique set of predictors that will produce the best performance. There is often a compensatory effect where, when one seemingly important variable is removed, the model adjusts using the remaining variables. This is especially true when there is some degree of correlation between the explanatory variables or when a low-bias models is used. For this reason, feature selection should not be used as a formal method of determining feature significance. More traditional inferential statistical approaches are a better solution for appraising the contribution of a predictor to the underlying model or to the data set.