7 Detecting Interaction Effects

7.1 What are Interactions?

In problems where prediction is the primary purpose, the majority of variation in the response can be explained by the cumulative effect of the important individual predictors. To this point in the book, we have focused on developing methodology for engineering categorical or numeric predictors such that the engineered versions of the predictors have better representations for uncovering and/or improving the predictive relationships with the response. For many problems, additional variation in the response can be explained by the effect of two or more predictors working in conjunction with each other. As a simple conceptual example of predictors working together, consider the effects of water and fertilizer on the yield of a field corn crop. With no water but some fertilizer, the crop of field corn will produce no yield since water is a necessary requirement for plant growth. Conversely, with a a sufficient amount of water but no fertilizer, a crop of field corn will produce some yield. However, yield is best optimized with a sufficient amount of water and a sufficient amount of fertilizer. Hence water and fertilizer, when combined in the right amounts, produce a yield that is greater than what either would produce alone. More formally, two or more predictors are said to interact if their combined effect is different (less or greater) than what we would expect if we were to add the impact of each of their effects when considered alone. Note that interactions, by definition, are always in the context of how predictors relate to the outcome. Correlations between predictors, for example, are not directly related to whether there is an interaction effect or not. Also, from a notational standpoint, the individual variables (e.g. fertilizer and water) are referred to as the main effect terms when outside of an interaction.

The predictive importance of interactions were first discussed in Chapter 2 where the focus of the problem was trying to predict an individual’s risk of ischemic stroke through the use of predictors based on the images of their carotid arteries. In this example, the predictive ability of the model was improved by including the interaction between the degree of remodeling of the arterial wall and the arterial wall’s maximum thickness. A visualization of the relationship between these predictors was presented in Figure 2.7 (a), where the contours represent equivalent multiplicative effects between the predictors. Part (b) of the same figure illustrated how the combination of theses predictors generated enhanced separation between patients who did and did not have a stroke.

In the previous example, the important interaction was between two continuous predictors. But important interactions can also occur between two categorical predictors or between a continuous and a categorical predictor which can be found in the Ames Housing data. Figure 7.1(a) illustrates the relationship between the age of a house (x-axis) and sale price of a house (y-axis), categorized by whether or not a house has air conditioning. In this example, the relationship between the age of a house and sales price is increasing for houses with air conditioning; on the other hand, there is no relationship between these predictors for homes without air conditioning. In contrast, part (b) of this figure displays two predictors that do not interact: home age and overall condition. Here the condition of the home does not affect the relationship between home and and sales price. The lack of interaction is revealed by the near parallel lines for each condition type.

Plots of Ames housing variables where the predictors interact (a) and do not interact (b).

Fig. 7.1: Plots of Ames housing variables where the predictors interact (a) and do not interact (b).

To better understand interactions, let’s consider a simple example with just two predictors. An interaction between two predictors can be mathematically represented as:

\[ y = \beta_0 + \beta_{1}x_1 + \beta_{2}x_2 + \beta_{3}x_{1}x_{2} + error\] In this equation, \(\beta_0\) represents the overall average response, \(\beta_1\) and \(\beta_2\) represent the average rate of change due to \(x_1\) and \(x_2\), respectively, and \(\beta_3\) represents the incremental rate of change due to the combined effect of \(x_1\) and \(x_2\) that goes beyond what \(x_1\) and \(x_2\) can explain alone. The error term represents the random variation in real data that cannot be explained in the deterministic part of the equation. Using data, the \(\beta\) parameters can be estimated using methods like linear regression for a continuous response or logistic regression for a categorical response. Once the parameters have been estimated, the usefulness of the interaction for explaining variation in the response can be determined. There are four possible cases:

  1. If \(\beta_3\) not significantly different from zero, then the interaction between \(x_1\) and \(x_2\) is not useful for explaining variation of the response. In this case, the relationship between \(x_1\) and \(x_2\) is called additive.

  2. If the coefficient is meaningfully negative while \(x_1\) and \(x_2\) alone also affect the response, then interaction is called antagonistic.

  3. At the other extreme, if the coefficient is positive while \(x_1\) and \(x_2\) alone also affect the response, then interaction is called synergystic.

  4. The final scenario occurs when the coefficient for the interaction is significantly different from zero, but either one or both of \(x_1\) or \(x_2\) do not affect the response. In this case, the average response of \(x_1\) across the values of \(x_2\) (or vice versa) has a rate of change that is essentially zero. However, the average value of \(x_1\) at each value of \(x_2\) (or conditioned on \(x_2\)) is different from zero. This situation occurs rarely in real data, and we need to understand the visual cues for identifying this particular case. This type of interaction will be called atypical.

To illustrate the first three types of interactions, simulated data will be generated using the formula from above with coefficients as follows: \(\beta_0 = 0\), \(\beta_1 = \beta_2 = 1\), and \(\beta_3 = -10\), 0, or 10 to illustrate antagonism, no interaction, or synergism, respectively. A random, uniform set of 200 samples were generated for \(x_1\) and \(x_2\) between the values of 0 and 1, and the error for each sample was pulled from a normal distribution. A linear regression model was used to estimate the coefficients and a contour plot of the predicted values from each model was generated (Figure 7.2). In this figure, the additive case has parallel contour lines with an increasing response from lower left to upper right. The synergistic case also has increasing response from lower left to upper right, but the contours are curved indicating that lower values of each predictor are required to elicit the same response value. The antagonistic case also displays curved contours; here the response decreases from lower left to upper right. What this figure reveals is that the response profile changes when an interaction between predictors is real and present.

Contour plots of the predicted response from a model between two predictors with a synergystic interaction, no interaction (additive), and an antagonistic interaction.

Fig. 7.2: Contour plots of the predicted response from a model between two predictors with a synergystic interaction, no interaction (additive), and an antagonistic interaction.

The atypical case is more clearly illustrated when \(x_1\) and \(x_2\) are categorical predictors. To simulate the atypical case, the two predictors will take values of ‘low’ and ‘high’, and the response will be generated by \(y = 2x_1x_2 + error\), where the low and high values of \(x_1\) and \(x_2\) will be numerically represented by -/+ 1, respectively. Figure 7.3 displays the relationship between the low and high levels of the factors. Notice that the average response for \(x_1\) at the low level and separately at the high level is approximately zero. However, when we condition the average on \(x_2\), represented by the two different colors, the responses are opposite at the two levels of \(x_1\). Here the individual effects of each predictor are not significant, but the conditional effect is strongly significant.

An example of an atypical interaction between two categorical factors.  In this case each factor alone is not significant, but the interaction between the factors is significant and is characterized by the tell-tale crossing pattern of the average values within groups.

Fig. 7.3: An example of an atypical interaction between two categorical factors. In this case each factor alone is not significant, but the interaction between the factors is significant and is characterized by the tell-tale crossing pattern of the average values within groups.

When we know which predictors interact, we can include them in a model. However, knowledge of which predictors interact may not be available. A number of recent empirical studies have shown that interactions can be uncovered by more complex modeling techniques. As Elith, Leathwick, and Hastie (2008) noted, tree-based models inherently model interactions between predictors through subsequent recursive splits in the tree. García-Magariños et al. (2009) showed that random forests were effective at identifying unknown interactions between single nucleotide polymorphisms; Lampa et al. (2014) found that a boosted tree model could uncover important interactions in epidemiology problems; and Chen et al. (2008) found that search techniques combined with a support vector machine was effective at identifying gene-gene interactions related to human disease. These findings prompt the question that if sophisticated predictive modeling techniques are indeed able to uncover the predictive information from interactions, then why should we spend any time or effort in pinpointing which interactions are important? The importance comes back to the underlying goal of feature engineering, which is to create features that improve the effectiveness of a model by containing predictively relevant information. By identifying and creating relevant interaction terms, the predictive ability of models that have better interpretability can be improved. Therefore, once we have worked to improve the form of individual predictors, the focus should then turn to searching for interactions among the predictors that could help explain additional variation in the response and improve the predictive ability of the model.

The focus of this chapter will be to explore how to search for and identify interactions between predictors that improve models’ predictive performance.

7.2 Guiding Principles in the Search for Interactions

First, expert knowledge of the system under study is critical for guiding the process of selecting interaction terms. Neter et al. (1996) recognize the importance of expert knowledge and recommend:

“It is therefore desirable to identify in advance, whenever possible, those interactions that are most likely to influence the response variable in important ways.”

Therefore, expert-selected interactions should be the first to be explored. In many situations, though, expert guidance may not be accessible or the area of research may be so new that there is no a priori knowledge to shepherd interaction term selection. When placed in a situation like this, we need sound guiding principles that will lead to a reasonable selection of interactions.

One field that has tackled this problem is statistical experimental design. Experimental design employs principles of control, randomization, and replication to construct a framework for gathering evidence to establish causal relationships between independent factors (i.e. predictors) and a dependent response. The process begins by determining the population (or units) of interest, the factors that will be systematically changed, and the specific ranges of the factors to be studied. Specific types of experimental designs, called screening designs, generate information in a structured way so that enough data is collected to identify the important factors and interactions among two or more factors. Figure 7.4 conceptually illustrates an experiment with three factors.

While data from an experiment has the ability to estimate all possible interactions among factors, it is unlikely that each of the interactions explain a significant amount of the variation in the measured response. That is, many of these terms are not predictive of the response. It is more likely, however, that only a few of the interactions, if any, are plausible or capture significant amounts of response variation. The challenge in this setting is figuring out which of the many possible interactions are truly important. Wu and Hamada (2011) provide a framework for addressing the challenge of identifying significant interactions. This framework is built on concepts of interaction hierarchy, effect sparsity, and effect heredity. These concepts have been shown to work effectively at identifying the most relevant effects for data generated by an experimental design. Furthermore, this framework can be extended to the predictive modeling framework.

The interaction hierarchy principle states that the higher degree of the interaction, the less likely the interaction will explain variation in the response. Therefore, pairwise interactions should be the first set of interactions to search for a relationship with the response, then three-way interactions, then four-way interactions, and so on. A pictorial diagram of effect hierarchy can be seen in Figure 7.4 for data that has three factors where the size of the nodes represents the likelihood that a term is predictive of the response. Neter et al. (1996) advises caution for including three-way interactions and beyond since these higher order terms rarely, if ever, capture a significant amount of response variation and are almost impossible to interpret.

The second principle, effect sparsity, contends that only a fraction of the possible effects truly explain a significant amount of response variation (Figure 7.4). Therefore, the approach could greatly pare down the possible main effects and interactions terms when searching for terms that optimize the predictive ability of a model.

A pictoral illustration of a set of first, second, and third-order terms from a full factorial experiment.  The terms are sized based on the interaction hierarchy principle.  The three-way interaction term is greyed out due to the effect sparsity principle.

Fig. 7.4: A pictoral illustration of a set of first, second, and third-order terms from a full factorial experiment. The terms are sized based on the interaction hierarchy principle. The three-way interaction term is greyed out due to the effect sparsity principle.

The effect heredity principle is based on principles of genetic heredity and asserts that interaction terms may only be considered if the ordered terms preceding the interaction are effective at explaining response variation. This principle has at least two possible implementations: strong heredity and weak heredity. The strong heredity implementation requires that for an interaction to be considered in a model, all lower-level preceding terms must explain a significant amount of response variation. That is, only the \(x_1 \times x_2\) interaction would be considered if the main effect of \(x_1\) and the main effect of \(x_2\) each explain a significant amount of response variation61. But if only one of the factors is significant, then the interaction will not be considered. Weak heredity relaxes this requirement and would consider any possible interaction with the significant factor. Figure 7.5 illustrates the difference between strong heredity and weak heredity for a three-factor experiment.

A pictoral illustration of the effect heredity principle for an example data set with three factors.  For this example, the $x_1$ main effect was found to be predictive of the response.  The weak heredity principle would explore the $x_1 x_2$ and $x_1 x_3$ interaction terms.  However, the $x_2 x_3$ interaction term would be excluded from investigation.  The strong heredity principle would not consider any second order terms since.

Fig. 7.5: A pictoral illustration of the effect heredity principle for an example data set with three factors. For this example, the \(x_1\) main effect was found to be predictive of the response. The weak heredity principle would explore the \(x_1 x_2\) and \(x_1 x_3\) interaction terms. However, the \(x_2 x_3\) interaction term would be excluded from investigation. The strong heredity principle would not consider any second order terms since.

While these principles have been shown to be effective in the search for relevant and predictive terms to be included in a model, they are not always applicable. For example, complex higher-order interactions have been shown to occur in nature. Bairey, Kelsic, and Kishony (2016) showed that high order interactions among species can impact ecosystem diversity. The human body is also a haven of complex chemical and biological interactions where multiple systems work together simultaneously to combat pathogens and to sustain life. These examples strengthen our point that the optimal approach to identifying important interaction terms will come from a combination of expert knowledge about the system we are modeling as well as carefully thought-out search routines based on sound statistical principles.

7.3 Practical Considerations

In the quest to locate predictive interactions, several practical considerations would need to be addressed, especially if there is little or no expert knowledge to suggest which terms should be screened first. A few of the questions to think through before discussing search methodology are: Is it possible with the data we have to enumerate and evaluate all possible predictive interactions? And if so, then should we? And should interaction terms be created before or after preprocessing the original predictors?

Let’s begin with the question of whether or not all of the interaction terms should be completely enumerated. To answer this question, the principles of interaction hierarchy and effect sparsity will be used to influence the approach. While high-order interactions (three-way and above) are possible for certain problems as previously mentioned, they likely occur infrequently (effect sparsity) and may only provide a small improvement on the predictive ability of the model (interaction hierarchy). Therefore, higher order interactions should be screened only if expert knowledge recommends investigating specific interaction terms. What is more likely to occur is that a fraction of the possible pairwise interaction terms contain predictively relevant information. Even though the recommendation here is to focus on searching through the pairwise interactions, it still may be practically impossible to evaluate a complete enumeration of these terms. More concretely, if we have \(p\) predictors, then there are \((p)(p-1)/2\) pairwise interaction terms. As the number of predictors increases, the number of interaction terms increases exponentially. With as few as 100 original predictors, complete enumeration requires a search of 4950 terms. A moderate increase to 500 predictors requires us to evaluate nearly 125,000 pairwise terms! More strategic search approaches will certainly be required for data that contain even a moderate numbers of predictors, and will be discussed in subsequent sections.

Another practical consideration is when interaction terms should be created relative to the preprocessing steps. Recall from Chapter 6 that preprocessing steps for individual numeric predictors can include centering, scaling, dimension expansion or dimension reduction. As shown in previous chapters, these steps help models to better uncover the predictive relationships between predictors and the response. What now must be understood is if the order of operations of preprocessing and creation of interaction terms affects the ability to find predictively important interactions. To illustrate how the order of these steps can affect the relationship of the interaction with the response, consider the interaction between the maximum remodeling ratio and the maximum stenosis by area for the stroke data (Chapter 2). For this data, the preprocessing steps were centering, scaling, and individually transformed. Figure 7.6 compares the distributions of the stroke groups when the interaction term is created before the preprocessing steps (a) and after the preprocessing steps (b). The box plots between the stroke groups make it clear that the interaction’s signal, captured by the shift between group distributions, is preserved when the interaction term is first created followed by the preprocessing steps. However, the interactive predictive signal is almost completely lost when the original predictors are preprocessed prior to creating the interaction term. This case demonstrates that we should be very thoughtful as to at what step interaction terms should be created. In general, interactions are most plausible and practically interpretable on the original scales of measurement. Therefore, the interaction terms should probably be created prior to any preprocessing steps. Also, this order of steps makes the most sense when we are considering more complex preprocessing steps of dimension expansion or dimension reduction.

An illustration of the impact of the order of operations for preprocessing predictors and creating interaction terms based on the Stroke data.  (a) Interactions terms are created then all terms are preprocessed.  (b) The predictors are first preprocessed then the unteraction terms are then created based on the preprocessed predictors.  The order of operations does matter; the interactive signal is preserved when interaction terms are created before preprocessing.

Fig. 7.6: An illustration of the impact of the order of operations for preprocessing predictors and creating interaction terms based on the Stroke data. (a) Interactions terms are created then all terms are preprocessed. (b) The predictors are first preprocessed then the unteraction terms are then created based on the preprocessed predictors. The order of operations does matter; the interactive signal is preserved when interaction terms are created before preprocessing.

7.4 The Brute-Force Approach to Identifying Predictive Interactions: Complete Enumeriation

For data sets that have a small to moderate number of predictors, all possible pairwise interactions for an association with the response can be evaluated. However, the more interaction terms that can be evaluated, the higher the probability that an interaction that is associated with the response due to random chance is found. More concretely, as more terms that are truly unrelated to the response are evaluated, the chance that at least one of them is associated with the response increases. Terms that have a statistically significant signal simply due to random chance and not due to a true relationship are called false positive findings. An entire sub-field of statistics is devoted to developing methodology for controlling the chance of false positive findings (Dickhaus 2014). False positive findings can substantially decrease a model’s predictive performance, and we need to protect against selecting these types of findings.

7.4.1 Simple Screening

The traditional approach to screening for important interaction terms is to use nested statistical models. For a linear regression model with two predictors, \(x_1\) and \(x_2\), the main effects model is:

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + error\]

The second model with main effects plus interactions is:

\[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_1x_2 + error\]

These two models are called “nested”" since the first model is a subset of the second. When models are nested, a statistical comparison can be made regarding the amount of additional information that is captured by the interaction term. For linear regression, the residual error is compared between these two models and the hypothesis test evaluates whether the improvement in error, adjusted for degrees of freedom, is sufficient to be considered real. The statistical test results in a p-value which reflects the probability that the additional information captured by the term is due to random chance. Small p-values, say less than 0.05, would indicate that there is less than a 5% chance that the additional information captured is due to randomness. Therefore, the smaller the value, the more likely that the information captured is due to a real effect of the interaction between the predictors. It should be noted that the 5% is the rate of false positive findings, and is a historical rule-of-thumb. However, if one is willing to take on more risk of false positive findings for a specific problem, then the cut-off can be set to a higher value.

For linear regression, the objective function used to compare models is the statistical likelihood (the residual error, in this case). For other models, such as logistic regression, the objective function to compare nested models would be the binomial likelihood.

An alternatively methodology for protecting against false positive findings was presented in Section 2.3. This methodology took the same nested model approach as above, but with two important differences:

  1. Resampling was used to create a number of nested model pairs for different versions of the training set. For each resample, the assessment set is used the evaluate the objective function. In the traditional approach, the same data used to create the model are re-predicted and used to evaluate the model (which is understandably problematic for some models). If the model overfits, computing the objective function from a separate data set allows a potential dissenting opinion that will reflect the overfitting.

  2. The other distinction between the resampling approach and the traditional approach is that any measure of performance can be used to compare the nested model; it need not be a statistically tractable metric. In Section 2.3, the area under the ROC curve was the objective function to make the comparison. We could have chosen to identify interactions that optimize sensitivity, specificity, accuracy, or any other appropriate metric. This provides great versatility and enables us to match the screening with an appropriate modeling technique and performance metric.

For both these reasons, but especially the second, the two approaches for evaluating nested models are not guaranteed to make consistent results. This is not problematic per se; different objective functions answer different questions and one cannot be said to be objectively better than the other without some context.

Friedman (2001) illustrates this potential disconnect between metrics. In his case, different metrics, binomial likelihood and the error rate, were used to choose the optimal the number of iterations for a boosted tree. Using the likelihood as the objective function, significantly fewer iterations was chosen when compared to using the misclassification rate. In his view:

“… the misclassification error rate continues to decrease well after the logistic likelihood has reached its optimum. This degrading the likelihood by overfitting actually improves misclassification error rate. Although perhaps counterintuitive, this is not a contradiction; likelihood and error rate measure different aspects of fit quality.”

Both the traditional approach and the resampling method from Section 2.3 generate p-values62. The more comparisons we make, the higher the chance that we find a false-positive interaction. There are a range of methods for controlling for false positive findings. At one extreme is to not control for these issues at all. We may choose to do this early in our explorations of potential interactions, keeping in mind that we will likely find terms that are false positives. At the other extreme is the Bonferroni correction (Shaffer 1995) which uses a very strict exponential penalty in order to minimize any false positive findings. Because of the severe nature of this penalty, it is not good when we need to perform many statistical tests. The false discovery rate (FDR) procedure, previously mentioned in Section 5.6, falls along this correction spectrum and was developed to minimize false positive findings when there are many tests (Benjamini and Hochberg 1995). Because complete enumeration can lead to exploring many possible pairwise interactions, the FDR can be a pragmatic compromise between no adjustment and the Bonferroni correction.

Let’s return to the stroke data example to illustrate evaluating interaction terms. Here nested models will be compared using the traditional approach with the p-values derived through resampling. Within each of these types of p-value computations, the original p-values will be compared to the FDR and Bonferroni adjusted p-values. Figure 7.7 illustrates the distribution of p-values under each calculation method and adjustment type. For this data, the traditional approach identifies more significant pairs regardless of adjustment type. Furthermore, the number of selected interactions decreases from no adjustment to the Bonferroni adjustment. The primary reason why the cross-validated approach identifies fewer terms is because of the resampling process. This process incorporates the natural variation in the data to identify pairs of predictors that have a true signal with respect to the response.

A comparison of the the distribution of p-values and adjustment method for p-values between traditional and cross-validated estimates for interactions among predictors that produce ROC values greater than 0.5.

Fig. 7.7: A comparison of the the distribution of p-values and adjustment method for p-values between traditional and cross-validated estimates for interactions among predictors that produce ROC values greater than 0.5.

7.4.2 The Least Absolute Shrinkage and Selection Operator (lasso)

Section 7.4 presented a way to search through potential interaction terms when complete enumeration is practically possible. This involved evaluating the change in model performance between the model with the individual predictors and the model with the individual predictors and the interaction between these predictors. This approach is simple to implement and can be effective at identifying interaction terms for further evaluation. However, evaluating interaction terms in a one-at-a-time fashion prevents the simplistic models from evaluating the importance of the interaction terms in the presence of the full model the individual predictors and pairwise interaction terms.

If complete enumeration is practically possible, then another approach to identifying important interactions would be to create and add all interaction terms to the data. By doing this, the data may contain more predictors (original predictors plus interaction terms) than samples. Some models such as trees (or ensembles of trees), neural networks, support vector machines, and K-nearest neighbors, are unaffected when the data contains more predictors than samples. However, other techniques like linear and logistic regression cannot be directly used under these conditions. But these models are much more interpretable and could provide good predictive performance if they could be utilized. A goal of feature engineering is to identify features that improve model performance while improving interpretability. Therefore, having a methodology for being able to utilize linear and logistic regression in the scenario where we have more predictors than samples would be useful.

A family of modeling techniques, called penalized models, has been developed to handle the case where we have more predictors than samples yet we still desire to use an interpretable model. To understand the basic premise of these tools, we need to briefly review the objective of linear regression. Suppose that our data consist of \(p\) predictors, labeled as \(x_1\), \(x_2\), through \(x_p\). Linear regression seeks to find the coefficients \(\beta_1\), \(\beta_2\), through \(\beta_p\) that minimize the sum of the squared distances (or error) from the estimated prediction of the samples to the observed values of the samples. The sum of squared errors can be written as:

\[ SSE = \sum_{i=1}^n (y_i-\widehat{y}_i)^2 \] where

\[ \widehat{y}_i = \widehat{\beta}_1x_1 + \widehat{\beta}_2x_2 + \ldots + \widehat{\beta}_px_p\] and \(\widehat{\beta}_j\) is the regression coefficient estimated from the available data that minimizes SSE. We can use calculus to solve the minimization problem. But it turns out that the solution requires inverting the covariance matrix of the predictors. If there are more predictors than samples or if one predictor can be written as a combination of one or more of the other predictors, then it is not possible to make the inversion. As predictors become more correlated with each other, the estimated regression coefficients get larger (inflate) and become unstable. Under these conditions, we need another solution. Hoerl (1970) recognized this challenge, and proposed altering the optimization problem to:

\[ SSE_{L_2} = \sum_{i=1}^n (y_i-\widehat{y}_i)^2 + \lambda \sum_{j=1}^P\beta_j^2 \]

Because the objective of the equation is to minimize the entire quantity, the \(\lambda\) term is called a penalty, and the technique is called ridge regression. As the regression coefficients grow large, the penalty must also increase to enforce the minimization. In essence, the penalty causes the resulting regression coefficients to become smaller and shrink towards zero. Moreover, by adding this simple penalty we can obtain reasonable parameter estimates. However, many of the parameter estimates may not be zero, thus making the resulting model interpretation much more difficult. Tibshirani (1996) made a simple and notable modification to the ridge optimization criteria and proposed minimizing the following equation:

\[ SSE_{L_1} = \sum_{i=1}^n (y_i-\widehat{y}_i)^2 + \lambda \sum_{j=1}^P|\beta_j| \]

This method was termed the least absolute shrinkage and selection operator (lasso). By modifying the penalization term, the lasso method forces regression coefficients to zero. In doing so, the lasso practically selects model terms down to an optimal subset of predictors and, as such, this method could be used on an entire data set of predictors and interactions to select those that yield the best model performance. Friedman, Hastie, and Tibshirani (2010) extended this technique so that it could be applied to the classification setting.

(a) The tuning parameter profile for the lasso method applied to the stroke data. (b) The distribution of the top selected interaction term, MaxMaxWallThickness x MaxStenosisByArea, by stroke outcome

Fig. 7.8: (a) The tuning parameter profile for the lasso method applied to the stroke data. (b) The distribution of the top selected interaction term, MaxMaxWallThickness x MaxStenosisByArea, by stroke outcome

Once again we will use the stroke data to illustrate the lasso method. To gauge the sensitivity of the method, we have added a random predictor which is a random permuataion of the MaxRemodelingRatio predictor. This predictor will act as a negative control and aid in defining the threshold for assessing the relevance of pairwise interactions. After filtering for highly correlated and near zero variance predictors, all pairwise interaction terms for the stroke data were created. The previously used five repeats of ten-fold cross-validation were applied to select the optimal \(\lambda\) value for the lasso. For these data, the value of the regularization parameter that maximized ROC was \(\lambda = 0.018\) (Figure 7.8 (a)). Using this value, the model select 16 terms. The distribution of the top interaction term, MaxRemodelingRatio x MaxStenosisByArea, by stroke outcome is displayed in Figure 7.8(b), and the top 16 terms and their corresponding regression coefficients are presented in Table 7.1. Clearly there was a shift in response between stroke outcome categories which the lasso is prioritizing. Interaction terms with lesser absolute coefficients than the random coefficient should be interpreted with caution.

Table 7.1: The top terms (predictors and pairwise interactions) selected by the lasso procedure.
Predictor 1 Predictor 2 Coefficient
MaxRemodelingRatio MaxStenosisByArea 1.333
MaxCALCAreaProp MaxMaxWallThickness 1.248
MaxMaxWallThickness MaxStenosisByArea 0.9117
CALCVol MaxDilationByArea -0.7348
LRNCVolProp MaxCALCAreaProp 0.6901
MaxMATXAreaProp MaxLRNCAreaProp -0.5586
MATXVolProp MaxLRNCArea 0.3231
MATXVolProp MaxMaxWallThickness 0.2901
MaxCALCArea MaxLRNCAreaProp -0.2121
MaxRemodelingRatio Random_Predictor -0.212
MaxCALCAreaProp MaxLRNCArea 0.1584
CALCVol MaxLRNCAreaProp -0.0595
MaxLRNCAreaProp Random_Predictor -0.0594
MATXVol MaxStenosisByArea 0.0424
CALCVol CALCVolProp <0.0001
CALCVol MATXVol <0.0001

For this data, the lasso appears to prefer primarily interaction terms in the model and gives a number of terms that are different from the simple screening approach from the previous section. Although the model is allowed to use individual predictors, it is not forced to. This means that the technique does not enforce the hierarchy principle although Bien, Taylor, and Tibshirani (2013) developed a methodology which does for this model.

7.5 Approaches when Complete Enumeration is Practically Impossible

As the number of predictors increases, the number of interaction terms to explore grows exponentially. Just a moderate number of predictors of several hundred, which could be considered small for today’s data, generates many more interactions than could be considered from a practical or statistical perspective. On the practical side, the computational burden likewise grows exponentially. With enough computing resources, it still could be possible to evaluate a vast number of interactions. Even if these computations are possible, this is not a statistically wise approach due to the chance of finding false positive results. Because the number of important interactions are usually small, the more insignificant interactions are screened, the greater penalty is incurred for evaluating these interactions. Therefore, truly significant interactions may be missed if all possible interactions are computed and appropriate adjustments are made.

When there are a large number of interaction terms to evaluate, other approaches are necessary to make the search more efficient, yet still effective, at finding important interactions. The remainder of this section will survey several approaches that have been developed to uncover interactions among predictors.

7.5.1 Guiding Principles and Two-stage Modeling

In a two-stage modeling approach, the ability of predictors to explain the variation in the response is evaluated using a model that does not directly account for interaction effects. Examples of such models are linear or logistic regression. Once the important predictors are identified, the residuals from this model are calculated. These residuals contain information that the individual predictors themselves cannot explain. The unexplained variation is due to random measurement error, and could also be due to predictor(s) that were not available or measured, or interactions among the observed predictors that were not present in the model. To be more concrete, suppose that the observed data was generated from the equation

\[ y = x_1 + x_2 + 10x_1x_2 + error\]

Suppose also that when the data from this system was collected only the response and \(x_1\) were observed. The best model we could estimate would be limited to

\[ y = \beta_1x_1 + error^*\] Since the true response depends on \(x_2\) and the interaction between \(x_1\) and \(x_2\), the error from the imperfect model (\(error^*\)) contains information about these important terms (that were absent from the model). If it was possible to access \(x_2\) and create the \(x_1 \times x_2\) interaction term, then it would be possible to separate out their contribution through a second model

\[ error^* = \beta_2x_2 + \beta_3x_1x_2 + error\] We cannot do anything to explain random measurement error–it is always inherent to some extent in data. And expert knowledge is necessary for determining if other predictors are relevant for explaining the response and should be included in the model. However, it is still possible to search for interaction terms that help to explain variation in the response. But the focus of this section is the setting where it is practically impossible to search all possible interaction terms.

When there are many predictors, the principles outlined in Section 7.2 can be used to narrow the search. The hierarchy principle would suggest to first look at pairwise interactions for assistance in explaining the residual variation. Next, the sparsity principle would indicate that if there are interactions that are active, then only a relatively few of them are important. And last, the heredity principle would search for these interactions among the predictors that were identified in the first stage. The type of heredity choosen at this point, weak or strong, will determine the number of interaction terms will be evaluated.

The regression coefficients from the lasso model on the individual predictors for the stroke data. The lasso model applied to the residuals selects the interaction between MaxRemodelingRatio and MaxStenosisByArea.

Fig. 7.9: The regression coefficients from the lasso model on the individual predictors for the stroke data. The lasso model applied to the residuals selects the interaction between MaxRemodelingRatio and MaxStenosisByArea.

To follow the logic of this process, consider modeling the stroke data with the vascuCAP predictors. In the first stage, the lasso selects 6 main effects. Figure 7.9 displays the estimated coefficients of these terms which identifies terms that are consistent with previous results. This model is then used to compute the residuals. If the response were numeric, then the residual for a sample would be computed as simply the observed value minus the predicted value. In the case of a classification outcome as is with the stroke data, the residuals must be computed differently. The Pearson residual will be used here and is defined as

\[ \frac{y_{i} - p_{i}}{\sqrt{p_{i}\left(1 - p_{i}\right)}} \]

For this data \(y_i\) is an indicator where 1 indicates that the \(i^{th}\) patient had a stroke, and 0 indicates the contrary. The model predicted probability that the \(i^{th}\) patient had a stroke is represented by \(p_i\). Since 6 individual predictors were selected, 15 interaction terms will be generated using the strong heredity principle. In the second stage, 1 interaction term was selected and was MaxRemodelingRatio x MaxStenosisByArea. While only 1 term was selected, it is consistent with the findings to this point.

7.5.2 Tree-based Methods

The methods discussed thus far can be effective at revealing interactions for data with a moderate number of predictors. However these approaches lose their practical effectiveness as the number of predictors grows. Tree-based methods like recursive partitioning or ensembles of recursive partitioning models can efficiently model data with a large number of predictors. Moreover, tree-based methods have been shown to uncover potential interactions between variables (L Breiman et al. 1984). In fact, using tree-based methods for identifying important interactions among predictors has been a popular area of theoretical and empirical research as we will see.

Let’s review the basic concepts of recursive partitioning to understand why this technique has the ability to uncover interactions among predictors. The objective of recursive partitioning is to find the optimal split point of a predictor that separates the samples into more homogeneous groups with respect to the response. Once a split is made, the same procedure is applied for the subset of samples at each of the subsequent nodes. This process then continues recursively until a minimum number of samples in a node is reached or until the process meets an optimization criterion. When data are split recursively, each subsequent node can be though of as representing a localized interaction between the previous node and the current node. In addition, the more frequent the occurrence of subsequent nodes of the same pair of predictors, the more likely that the interaction between the predictors is global interaction across the observed range of the predictors.

A recursive partitioning tree for a synergystic interaction between the two predictors.  The interaction is captured by several alternating splits between the two predictors at subsequent levels of the tree.

Fig. 7.10: A recursive partitioning tree for a synergystic interaction between the two predictors. The interaction is captured by several alternating splits between the two predictors at subsequent levels of the tree.

The simple two-predictor example of a synergistic relationship presented in Section 7.1 will be used to demonstrate how an interaction can be detected through recursive partitioning. The optimal tree for this data is presented in Figure 7.10. Tree-based models are usually thought of as pure interaction due to their prediction equation, which can be written as a set of multiplicative statements. For example, the path to terminal node 4 goes through three splits and can be written as a single rule:

node_4 <- I(x1 < 0.655) * I(x2 < 0.5) * I(x2 < 0.316) * 0.939

where 0.939 is the mean of the training set data falling into this node (I is a function that is zero or one based on the logic inside the function). The final prediction equation would be computed by adding similar terms from the other eight terminal nodes (which would all be zero except a single rule when predicting). Notice that in Figure 7.10 the terminal nodes have small groups of the data that are homogeneous and range from smaller response values (left) to larger response values (right) on each side of the first split. Achieving homogeneous terminal nodes requires information from both predictors throughout the tree.

But why would a tree need to be this complicated model the somewhat simple synergistic relationship between \(x_1\) and \(x_2\)? Recall from Figure 7.2 that the synergistic relationship induces curved contours for the response. Because the recursive partitioning model breaks the space into rectangular regions, the model needs more regions to adequately explain the response. In essence, the partitioning aspect of trees impedes their ability to represent smooth, global interactions.

To see this, let’s look at Figure 7.11. In this figure, the predictions from each terminal node are represented by rectangular regions where the color indicates the predicted response value for the region. Overlayed on this graph are the contour lines generated from the linear model where the interaction term has been estimated. The tree-based model is doing well at approximating the synergistic response. But, because the model breaks the space into rectangular regions, it requires many regions to explain the relationship.

An illustration of the predicted response from the recursive partitioning model between two predictors that have a synergistic relationship.  The regions are able to approximate the relationship with the response in a pixelated way.

Fig. 7.11: An illustration of the predicted response from the recursive partitioning model between two predictors that have a synergistic relationship. The regions are able to approximate the relationship with the response in a pixelated way.

Figure 7.11 illustrates that a tree pixelates the relationship between the predictors and the response. It is well-known that the coarse nature of the predictions from a tree generally makes this model less accurate than other models. In addition, trees have some potential well-known liabilities. As one example, an individual tree can be highly variable (i.e. the model can substantially change with small changes to the data). Ensembles of trees such as bagging (Breiman 2001) and boosting (Friedman 2001) have been shown to mitigate these problems and improve the predictivity of the response.

A brief explanation of these techniques is provided here to aid in the understanding of these techniques. An in-depth explanation of these techniques can be found in Kuhn and Johnson (2013). To build a bagged tree-model, many samples of the original data are independently generated with replacement. For each sample, a tree is built. To predict a new sample, the model averages the predicted response across all of the trees. Boosting also utilizes a sequence of trees, but creates these in a fundamentally different way. Instead of building trees to maximum depth, a boosted model restricts the tree depth. In addition, a boosted tree model utilizes the predictive performance of the previous tree on the samples to reweigh poorly predicted samples. Third, the contribution of each tree in the boosted model is weighted using statistics related to the individual model fit. The prediction for a new sample is a weighted combination across the entire set of trees.

Recent research has shown that these ensemble methods are able to identify important interactions among the predictors. García-Magariños et al. (2009) and Qi (2012) illustrate how random forest, which is a further modification of bagging, can be used to find interactions in biological data. While Basu et al. (2018) uses a variation of the random forest algorithm to identify high-order interactions. For boosting, Lampa et al. (2014) and Elith, Leathwick, and Hastie (2008) found that this methodology could also be used to uncover important interactions, Using a theoretical perspective, Friedman and Popescu (2008) proposed a new statistics for assessing the importance of an interaction in a tree-based model.

The reason why tree ensembles are effective at identifying predictor-response relationships is because they are aggregating many trees from slightly different versions of the original data. The aggregation allows for many more possible response values and better approximating the predictor-response relationships. To see this we will build a random forest model and a gradient boosting machine model on the two-predictor synergistic example.

An illustration of the predicted response from boosted and bagged tree models when the predictors have a synergistic relationship.  These ensemble models can better approximate the relationship between the predictors an the response compared to a single tree.

Fig. 7.12: An illustration of the predicted response from boosted and bagged tree models when the predictors have a synergistic relationship. These ensemble models can better approximate the relationship between the predictors an the response compared to a single tree.

Figure 7.12 shows that ensembles of trees are better able to approximate the synergistic relationship than an individual tree. In fact, the cross-validated root mean squared error (RMSE) for boosting is 0.27 and for bagged trees is 0.61. The optimal single tree model worse with a cross-validated RMSE of 0.91. In this example, smoother models win; the linear regression interaction model’s RMSE 0.05.

Notice that an individual tree or ensemble of trees’ are less effective and are much more complicated models for uncovering predictive interactions that are that are globally true across all of the samples. This simple example highlights the main premise of feature engineering. If meaningful predictor relationships can be identified and placee in a more effective form, then a simpler and perhaps more appropriate model to estimate the relationship between the predictors and the response can be used. However, there is a scenario where tree-based methods may have better predictive ability. When important interactions occur within one or more subsets of the samples (i.e., locally), then tree-based methods have the right architecture to model the interactions. Creating features for localized interactions to be included in simpler models is much more tedious. This process would require the creation of a term that is a product of the identifier of the local space (i.e. a dummy predictor) and the predictors that interact within that space, which is essentially an interaction among three terms.

Since trees and ensembles of trees can approximate interactions among predictors, can we harness information from the resulting models to point us to terms that should be included in a simpler linear model? The approach presented by Basu et al. (2018) uses a form of a random forest model (called feature weighted random forests) that randomly selects features based on weights that are determined by the features’ importance. After the ensemble is created, a metric for each co-occurring set of features is calculated which can then be used to identify the top interacting features.

Friedman and Popescu (2008) present an approach based on the theoretical concept of partial dependence (Friedman 2001) for identifying interactions using ensembles. In short, this approach compares the joint effect of two (or more) predictors with the individual effect of each predictor in a model. If the individual predictor does not interact with any of the other predictors, the difference between the joint effect and the individual effect will be close to zero. However, if there is an interaction between an individual predictor and at least one of the predictors, the difference will be greater than zero. This comparison is numerically summarized in what they term the \(H\) statistic. For the 20 predictors in the stroke data (including the random predictor), the \(H\) statistics were computed on the entire training set. One issue is that the \(H\) statistic, across predictors, can have a fairly bimodal distribution since many of the values be zero (when a predictor isn’t involved in any interactions). This data set is fairly small and, for these reasons, 100 bootstrap estimates were computed and summarized using the median \(H\) statistic. Both estimates are shown in Figure 7.13. The original statistics and their resampled counterparts agree fairly well with one exception. Variable MaxLRNCArea has negligible resampled value while the estimate on the entire data set is very strong. A cutoff of \(H >= 0.005\) was used, on both estimates, to select variables. This would lead us to believe that 8 predictors are involved in at least one interaction. Following the previous two-stage approach, all 28 two-way interactions of these predictors where added to the main effects in a LASSO model (mimicking the previous analysis). Using resampling, the optimal tuning parameter value was \(\lambda = 0.04\) was found. From this model, the number of two way interactions was reduced from 28 to 5: MaxMaxWallThickness x MaxStenosisByArea, MaxMaxWallThickness x MaxCALCAreaProp, MaxRemodelingRatio x MaxStenosisByArea, MaxLRNCArea x MaxCALCAreaProp, and MaxStenosisByArea x MaxCALCAreaProp. Here, again, the random predictor can be used to gauge the threshold for important interactions. In this case, the random predictor lies below the significance cutoff of \(H >= 0.005\).

A visualization of the H statistic estimates for the stroke data.

Fig. 7.13: A visualization of the H statistic estimates for the stroke data.

A third and practical way to identify interactions use predictor importance information along with principles of hierarchy, sparsity, and heredity. Simply put, if an interaction among predictors is important, then a tree-based method will use these predictors multiple times throughout multiple trees to uncover the relationship with the response. These predictors will then have non-zero, and perhaps high, importance values. Pairwise interactions among the most important predictors can then be created and evaluated for their relevance in predicting the response. This is very similar to the two-stage modeling approach, except that the base learner is a tree-based model.

Top 10 variable importance scores for a random forest model applied to the stroke data.

Fig. 7.14: Top 10 variable importance scores for a random forest model applied to the stroke data.

Let’s revisit the stroke data, estimate the optimal tuning parameter for a random forest model, and compute the predictor importance values. Figure 7.14 displays the top 10 predictors by importance, most of which have all been identified as part of multiple pairwise interactions from the other methods. Of particular importance are MaxMaxWallThickness and MaxStenosisByArea which are most important predictors and are a scientifically meaningful interaction. This approach is similar in spirit to the two-stage approach discussed above. The primary difference is that random forest model is simply used for identifying predictors with the potential for being part of important interactions. Interactions among the top predictors identified here would then be included in a simpler model, like logistic regression, to evaluate their importance.

7.5.3 The Feasible Solution Algorithm

As mentioned above, the problem of identifying important interactions can be thought of as identifying the optimal model that includes main effects and interactions. A complete enumeration and evaluation of all of the possible models is practically impossible for even a moderate number of predictors since the number of models is \(p!\). When using a linear model such as linear or logistic regression, methods such as forward, backward, and stepwise selection are commonly used to locate an optimal set of predictors (Neter et al. 1996). These techniques are as straightforward to understand as their names imply. The forward selection technique starts with no predictors and identifies the predictor that is most optimally related to the response. In the next step, the next predictor is selected such that it in combination with the first predictor are most optimally related to the response. This process continues until none of the remaining predictors improve the model based on the optimization criteria. The backward selection technique starts with all of the predictors and sequentially removes the predictor that has the least contribution to the optimization criteria. This process continues until the removal of a predictor significantly reduces the model performance. The stepwise procedure adds or removes one predictor at a time at each step based on the optimization criteria until the addition or removal of any of the predictors does not significantly improve the model63.

Miller (1984) provided an alternative search technique to the stepwise method for the linear model problem. Instead of removing a predictor, he recommended a substitution approach in which each selected predictor is systematically chosen for substitution. For example, suppose that we have 10 predictors, \(x_1\)\(x_{10}\), and we want to find the best model that contains three predictors. Miller’s process begins by randomly selecting three of the ten predictors, say \(x_2\), \(x_5\) and \(x_9\). Model performance is calculated for this trio of predictors. Then the first two predictors are fixed and model performance of these two predictors is computed with the remaining seven other predictors. If none of the other seven predictors give better performance than \(x_9\), then this predictor is kept. If \(x_9\) is not optimal, then it is exchanged with the best predictor. The process then fixes the first and third predictors and uses the same steps to select the optimal second predictor. Then, the second and third predictor are fixed and a search is initiated for the optimal first predictor. This process continues to iterate until it converges on an optimal solution. Because the algorithm is not guaranteed to find the global optimal solution, different random starting sets of predictors should be selected. For each different starting set, the converged solution and the frequency of each solution is determined. Hawkins (1994) extended this algorithm to a robust modeling approach and called the procedure the Feasible Solution Algorithm (FSA).

A primary advantage of this technique is that the search space of the algorithm, which is on the order of \(q \times m \times p\) where \(q\) is the number of random starts, \(m\) is the number of terms in the subset, and \(p\) is the number of predictors, is much smaller than the entire search space (\(p^m\)). The FSA algorithm has been shown to converge to an optimal solution, although it is not guaranteed to be the global optimum.

The FSA algorithm focuses on finding an optimal subset of main effects. However, the original implementation of FSA does not consider searching for interactions among predictors. This can be a considerable drawback to using the algorithm since important interactions do occur among predictors for many problems. Lambert et al. (2018) recognized this deficiency and has provided a generalization to the algorithm to search for interactions. Lambert’s approach begins with a base model, which typically includes predictors thought to be important for predicting the response. The order of the desired interaction is chosen. The process of identifying potentially relevant interactions follows the same type of logic as the original FSA algorithm. To illustrate the process, let’s return to the example from above. Suppose that the base model included three predictors (\(x_1\), \(x_5\), and \(x_9\)) and the goal is to identify important pairwise interactions. Lambert’s algorithm randomly selects two predictors, say \(x_3\) and \(x_9\), and computes the performance of the model that includes the base terms and this interaction term. The algorithm then fixes the first predictor and systematically replaces the second predictor with the remaining eight (or nine if we want to study quadratic effects) predictors and computes the corresponding model performance. If none of the other predictors give better predictive performance, then \(x_9\) is kept. But if \(x_9\) is not optimal, then it is exchanged with the best predictor. The process then fixes the second predictor and exchanges the first predictor with eight (or nine) of the other predictors. This process continues to iterate until convergence. Then a different randomly pair of predictors is selected and the process is applied. The converged interactions are tallied across the random starts to identify the potential feasible solutions. The algorithm can be easily generalized to any order of interaction.

Lambert has developed methodology to implement the algorithm, which will now be demonstrated using the ischemic stroke data. From Chapter 2, recursive feature elimination for a logistic regression model selected MaxMaxWallThickness, MATXVol, MaxRemodelingRatio, and MaxStenosisByArea as either individual predictors or as part of pairwise interactions. These terms will be included in the base model and pairwise interactions will be assessed across 100 random starts that help to improve the model. In addition, the FSA algorithm is executed within resampling (using the same repeats of 10-fold cross-validation used previously)

The results are presented in Table 7.2. The FSA algorithm identified MaxRemodelingRatio x MaxStenosisByArea most frequently across the 100 random starts. In addition, the other most frequently identified interactions are CALCVolProp x MaxLRNCArea, LRNCVolProp x Random_Predictor, and MaxCALCAreaProp x MaxMaxWallThickness, which would be the primary terms to explore for including in a model.

Table 7.2: Top interaction pairs selected by the Feasible Solution Algorithm for the Stroke Data. Selections represents the number of times the interaction was selected across 100 random starts of the algorithm across all of the resamples.
Predictor 1 Predictor 2 Frequency
MaxLRNCAreaProp MaxRemodelingRatio 10
MaxLRNCAreaProp MaxMATXAreaProp 8
MATXVolProp MaxLRNCAreaProp 5
MaxLRNCAreaProp Random_Predictor 5
MaxCALCAreaProp MaxMaxWallThickness 4
MATXVolProp MaxCALCArea 3
MATXVolProp MaxCALCAreaProp 3
MaxCALCAreaProp MaxRemodelingRatio 3
MaxLRNCAreaProp MaxMATXArea 3
MaxRemodelingRatio MaxStenosisByArea 3

7.6 A Summary of the Stoke Interaction Effects

A comparison of the interaction terms for the stroke data across several methods.

Fig. 7.15: A comparison of the interaction terms for the stroke data across several methods.

The methods applied to the stroke data have identified some similar and different terms for follow-up. summary of the selected interactions by each method is provided in Figure 7.15. The importance of the terms have been scaled to a range between zero and one within each method to visually compare the strength of the term across methods. For example, the FSA selections are scaled based on the frequency of selection, with the most frequent term receiving a value of 1. On the other hand, all terms in ther random forest selection have the same value since this is just an indicator of which terms to subsequently pursue. Some terms, like MaxRemodlingRatio \(\times\) MaxStenosisByArea are selected by multiple methods, and likely contain useful information for the model. A number of the other terms are selected by only one method. In total, there are 120 possible pairwise interactions among the uncorrelated vascuCAP predictors. The union of the methods discussed here focus on 43 of these interactions, where the two-stage approach selects the least and random forest and FSA select the most.

7.7 Other Potentially Useful Tools

There are a few other models that have the ability to uncover predictive interactions that are worthy of mentioning here. Multivariate adaptive regression splines (MARS) is a nonlinear modeling technique for a continuous response that searches through individual predictors and individual values of each predictor to find the predictor and sample value for creating a hinge function that describes the relationship between the predictor and response (Friedman 1991). The hinge function was described previously in Section 6.2.1 and enables the model to search for nonlinear relationships. In addition to being able to search through individual predictors, MARS can be used to search for products of predictors in order to make nonlinear interactions that isolate portions of the predictor space. The MARS technique also has built-in feature selection. The primary disadvantage of this approach is that it is computationally taxing. MARS has also been extended to classification outcomes, and this method is called flexible discriminant analysis (FDA).

Cubist (Kuhn and Johnson 2013) is a rule-based regression model that builds an initial tree and decomposes it into set of rules that are pruned and perhaps eliminated. For each rule, a separate linear model is defined. One example of a rule for the Ames housing data might be:


  • The year built is less than 1952 and
  • the general living area is less than 1692 \(ft^2\) and
  • the overall condition is either: Very Poor, Poor, Fair, Below Average, or Average


\[log_{10} (\text{Sale Price}) = 4.052 + 0.00045 \text{ (general living area)} + 0.00015 \text{ (year built)}\]

This structure creates a set of disjoint interactions; the set of rules may not cover every combination of values for the predictors that are used in the rules. New data points being predicted may belong to multiple rules and, in this case, the associated linear model predictions are averaged. This model structure is very flexible and has a strong chance of finding local interactions within the entire predictor space. For example, when a property has a small total living area but has a large number of bedrooms, there is usually a negative slope in regression models associated with the number of bedrooms. This is certainly not true in general but only applied to properties that all “all bedroom.”


Elith, J, J Leathwick, and T Hastie. 2008. “A Working Guide to Boosted Regression Trees.” Journal of Animal Ecology 77 (4). Wiley Online Library:802–13.

García-Magariños, M, I López-de-Ullibarri, R Cao, and A Salas. 2009. “Evaluating the Ability of Tree-Based Methods and Logistic Regression for the Detection of Snp-Snp Interaction.” Annals of Human Genetics 73 (3). Wiley Online Library:360–69.

Lampa, E, L Lind, P Lind, and A Bornefalk-Hermansson. 2014. “The Identification of Complex Interactions in Epidemiology and Toxicology: A Simulation Study of Boosted Regression Trees.” Environmental Health 13 (1). BioMed Central:57.

Chen, SH, J Sun, L Dimitrov, A Turner, T Adams, D Meyers, BL Chang, et al. 2008. “A Support Vector Machine Approach for Detecting Gene-Gene Interaction.” Genetic Epidemiology 32 (2). Wiley Online Library:152–67.

Neter, J, M Kutner, C Nachtsheim, and W Wasserman. 1996. Applied Linear Statistical Models. Vol. 4. Irwin Chicago.

Wu, CF Jeff, and Michael S Hamada. 2011. Experiments: Planning, Analysis, and Optimization. John Wiley & Sons.

Bairey, E, E Kelsic, and R Kishony. 2016. “High-Order Species Interactions Shape Ecosystem Diversity.” Nature Communications 7:12285.

Dickhaus, T. 2014. “Simultaneous Statistical Inference.” AMC 10. Springer:12.

Friedman, J. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 1189–1232.

Shaffer, J. 1995. “Multiple Hypothesis Testing.” Annual Review of Psychology 46 (1):561–84.

Benjamini, Y, and Y Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 289–300.

Hoerl, R, Aand Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1):55–67.

Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), 267–88.

Friedman, J, T Hastie, and R Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1):1.

Bien, J, J Taylor, and R Tibshirani. 2013. “A Lasso for Hierarchical Interactions.” Annals of Statistics 41 (3):1111.

Breiman, L, J Friedman, C Stone, and R Olshen. 1984. Classification and Regression Trees. CRC press.

Breiman, L. 2001. “Random Forests.” Machine Learning 45 (1). Springer:5–32.

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.

Qi, Y. 2012. “Random Forest for Bioinformatics.” In Ensemble Machine Learning, 307–23. Springer.

Basu, S, K Kumbier, J Brown, and B Yu. 2018. “Iterative Random Forests to Discover Predictive and Stable High-Order Interactions.” Proceedings of the National Academy of Sciences 115 (8):1943–8.

Friedman, J, and B Popescu. 2008. “Predictive Learning via Rule Ensembles.” The Annals of Applied Statistics 2 (3):916–54.

Miller, A. 1984. “Selection of Subsets of Regression Variables.” Journal of the Royal Statistical Society. Series A (General), 389–425.

Hawkins, D. 1994. “The Feasible Solution Algorithm for Least Trimmed Squares Regression.” Computational Statistics & Data Analysis 17 (2):185–96.

Lambert, J, L Gong, CF Elliot, K Thompson, and A Stromberg. 2018. “An R Package for Finding Best Subsets and Interactions.” The R Journal.

Friedman, J. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics 19 (1):1–141.

  1. For this reason, the interaction shown in Figure 7.3 was labeled as “atypical.”

  2. Although, as noted in Section 3.7, there are Bayesian methods that can be used to compare models without generating p-values.

  3. This is a feature selection algorithm similar to those discussed in Chapter 11.