## 10.2 Classes of Feature Selection Methodologies

Feature selection mythologies fall into three general classes: intrinsic (or implicit) methods, filter methods, and wrapper methods. Intrinsic methods have feature selection naturally incorporated with the modeling process. Whereas filter and wrapper methods work to marry feature selection approaches with modeling techniques. The most seamless and important of the three classes for reducing features are intrinsic methods. Some examples include:

Tree- and rule-based models. These models search for the best predictor and split point such that the outcomes are more homogeneous within each new partition (recall Section 5.7). Therefore if a predictor is not used in any split, it is functionally independent of the prediction equation and has been excluded from the model. Ensembles of trees have the same property, although some algorithms, such as random forest, deliberately force splits on irrelevant predictors when constructing the tree. This results in the over selection of predictors (as shown below).

Multivariate adaptive regression spline (MARS) models. These models create new features of the data that involve one or two predictors at a time (Section 6.2.1). The predictors are then added to a linear model in sequence. Like trees, if a predictor is not involved in at least one MARS feature, it is excluded from the model.

Regularization models. The regularization approach penalizes or shrinks predictor coefficients to improve the model fit. The lasso (Section 7.3) uses a type of penalty that shrinks coefficients to absolute zero. This forces predictors to be excluded from the final model.

The advantage of implicit feature selection methods is that they are relatively fast since the selection process is embedded within the model fitting process; no external feature selection tool is required. Also, implicit methods provide a direct connection between selecting features and the *objective function*. The objective function is the statistic that the model attempts to optimize (e.g. the likelihood in generalized linear models or the impurity in tree-based models). The direct link between selection and modeling objective makes it easier to make informed choices between feature sparsity and predictive performance.

The main downside to intrinsic feature selection is that it is model-dependent. If the data are better fit by a non-intrinsic feature selection type of model, then predictive performance may be sub-optimal when all features are used. In addition, some intrinsic models (like single tree models) use a *greedy* approach to feature selection. Greedy approaches usually identify a narrow set of features that have sub-optimal predictive performance. A more detailed discussion of this problem appears later in this chapter.

If a model does not have intrinsic feature selection, then some sort of search procedure is required to identify feature subsets that improve predictive performance. There are two general classes of techniques for this purpose: *filters* and *wrappers*.

Filter methods conduct an initial supervised analysis of the predictors to determine which are important and then only provide these to the model. Here, the search is performed just once. The filter methods often consider each predictor separately although this is not a requirement. An example of filtering occurred in Section 5.6 where we considered a set of keywords derived from the OkCupid text fields. The relationship between the occurrence of the keyword and the outcome was assessed using an odds-ratio. The rational to keep or exclude a word (i.e. the filtering rule) was based on statistical significance as well as the magnitude of the odds-ratio. The words that made it past the filter were then added to the logistic regression model. Since each keyword was considered separately, they are unlikely to be capturing independent trends in the data. For instance the words `programming`

and `programmer`

were both selected using the filtering criteria and represent nearly the same meaning with respect to the outcome

Filters are simple and tend to be fast. In addition, they can be effective at capturing the large trends (i.e. individual predictor-outcome relationship) in the data. However, as just mentioned, they are prone to over-selecting predictors. In many cases, some measure of statistical significance is used to judge “importance” such as a raw or multiplicity adjusted p-value. In these cases, there may be a disconnect between the objective function for the filter method (e.g. significance) and what the model requires (predictive performance). In other words, a selection of predictors that meets a filtering criteria like statistical significance may not be a set that improves predictive performance^{78}. Filters will be discussed in more detail in Chapter 11.

Wrapper methods use iterative search procedures that repeatedly supply predictor subsets to the model and then use the resulting model performance estimate to guide the selection of the *next subset* to evaluate. If successful, a wrapper method will iterate to a smaller set of predictors that has better predictive performance than the original predictor set. Wrapper methods can take either a *greedy* or non-greedy approach to feature selection. A greedy search is one that chooses the search path based on the direction that seems *best at the time* in order to achieve the best immediate benefit. While this can be an effective strategy, it may show immediate benefits in predictive performance that stall out at a locally best setting. A non-greedy search method would re-evaluate previous feature combinations and would have the ability to move in a direction that is initially unfavorable if it appears to have a potential benefit *after* the current step. This allows the non-greedy approach to escape being trapped in a local optima.

An example of a greedy wrapper method is backwards selection (otherwise known as *recursive feature elimination* or RFE). Here, the predictors are initially ranked by some measure of importance. An initial model is created using the complete predictor set. The next model is based on a smaller set of predictors where the least important have been removed. This process continues down a prescribed path (based on the ranking generated by the importances) until a very small number of predictors are in the model. Performance estimates are used to determine when too many features have been removed; hopefully a smaller subset of predictors can result in an improvement. Notice that the RFE procedure is greedy in that it considers the variable ranking as the search direction. It does not re-evaluate the search path at any point or consider subsets of mixed levels of importance. This approach to feature selection will likely fail if there are important interactions between predictors where only one of the predictors is significant in the presence of the other(s). RFE will be discussed in more detail in Chapter 11 and in a section below.

Examples of non-greedy wrapper methods are genetic algorithms (GA) and simulated annealing (SA). The SA method is non-greedy since it incorporate randomness into the feature selection process. The random component of the process helps SA to find new search spaces that often lead to more optimal results. Both of these techniques will be discussed in detail in Chapter 12.

Wrappers have the potential advantage of searching a wider variety of predictor subsets than simple filters or models with built-in feature selection. They have the most potential to find the globally best predictor subset (if it exists). The primary drawback is the computational time required for these methods to find the optimal or near optimal subset. The additional time can be excessive to the extent of being counter-productive. The computational time problem can be further exacerbated by the type of model with which it is coupled. For example, the models that are in most need of feature selection (e.g. SVMs and neural networks) can be very computationally taxing themselves. Another disadvantage of wrappers is that they have the most potential to overfit the predictors to the training data and require *external validation* (as discussed below).

Figure 10.1 visually contrasts greedy and non-greedy search methods using the Goldstein-Price equation:

\[\begin{align} f(x, y) &= [1+(x+y+1)^2(19-14x+3x^2-14y+6xy+3y^2)] \times \notag \\ & [30+(2x-3y)^2(18-32x+12x^2+48y-36xy+27y^2)] \notag \end{align}\]The minimum of this function, where both \(x\) and \(y\) are between -2 and 2, is located at \(x = 0\) and \(y = -1\). The function’s outcome is relatively flat in the middle with large values along the top and lower-right edges. Panel (a) shows the results of two greedy methods that use gradient decent to find the minimum. The first starting value (in red, top right) moves along a path defined by the gradient and stalls at a local minimum. It is unable to re-evaluate the search. The second attempt, starting at \(x = -1\) and \(y = 1\), also follows a different greedy path and, after some course-corrections, finds a solution close to the best value. Each change in the path was to a new direction of immediate benefit. Panel (b) shows a global search methods called *simulated annealing*. This is a controlled but random approach that will be discussed in greater detail later. It generates a new candidate point using random deviations from the current point. If the new point is better, the process continues. If it is a worse solution, it may accept it with a certain probability. In this way, the global search does not always proceed in the current best direction and tends to explore a broader range of candidate values. In this particular search, a value near the optimum is determined very quickly.

Given the advantages and disadvantages of the different types of feature selection methods, how should one utilize these techniques? A good strategy which we use in practice is to start the feature selection process with one or more intrinsic methods to see what they yield. Note that it is unrealistic to expect that models using intrinsic feature selection would select the same predictor subset, especially if linear and nonlinear methods are being compared. If a non-linear intrinsic method has good predictive performance, then we could proceed to a wrapper method that is combined with a non-linear model. Similarly, if a linear intrinsic method has good predictive performance, then we could proceed to a wrapper method combined with a linear model. If multiple approaches select *large* predictor sets then this may imply that reducing the number of features may not be feasible.

Before proceeding, let’s examine the effect of irrelevant predictors on different types of models and how the effect changes over different scenarios.