7 Detecting Interaction Effects

In problems where prediction is the primary purpose, the majority of variation in the response can be explained by the cumulative effect of the important individual predictors. To this point in the book, we have focused on developing methodology for engineering categorical or numeric predictors such that the engineered versions of the predictors have better representations for uncovering and/or improving the predictive relationships with the response. For many problems, additional variation in the response can be explained by the effect of two or more predictors working in conjunction with each other. As a simple conceptual example of predictors working together, consider the effects of water and fertilizer on the yield of a field corn crop. With no water but some fertilizer, the crop of field corn will produce no yield since water is a necessary requirement for plant growth. Conversely, with a a sufficient amount of water but no fertilizer, a crop of field corn will produce some yield. However, yield is best optimized with a sufficient amount of water and a sufficient amount of fertilizer. Hence water and fertilizer, when combined in the right amounts, produce a yield that is greater than what either would produce alone. More formally, two or more predictors are said to interact if their combined effect is different (less or greater) than what we would expect if we were to add the impact of each of their effects when considered alone. Note that interactions, by definition, are always in the context of how predictors relate to the outcome. Correlations between predictors, for example, are not directly related to whether there is an interaction effect or not. Also, from a notational standpoint, the individual variables (e.g., fertilizer and water) are referred to as the main effect terms when outside of an interaction.

The predictive importance of interactions were first discussed in Chapter 2 where the focus of the problem was trying to predict an individual’s risk of ischemic stroke through the use of predictors based on the images of their carotid arteries. In this example, an interaction between the degree of remodeling of the arterial wall and the arterial wall’s maximum thickness was identified as a promising candidate. A visualization of the relationship between these predictors was presented in Figure 2.7 (a), where the contours represent equivalent multiplicative effects between the predictors. Part (b) of the same figure illustrated how the combination of theses predictors generated enhanced separation between patients who did and did not have a stroke.

In the previous example, the important interaction was between two continuous predictors. But important interactions can also occur between two categorical predictors or between a continuous and a categorical predictor which can be found in the Ames housing data. Figure 7.1(a) illustrates the relationship between the age of a house (x-axis) and sale price of a house (y-axis), categorized by whether or not a house has air conditioning. In this example, the relationship between the age of a house and sales price is increasing for houses with air conditioning; on the other hand, there is no relationship between these predictors for homes without air conditioning. In contrast, part (b) of this figure displays two predictors that do not interact: home age and overall condition. Here the condition of the home does not affect the relationship between home and and sales price. The lack of interaction is revealed by the near parallel lines for each condition type.

Figure 7.1: Plots of Ames housing variables where the predictors interact (a) and do not interact (b).

To better understand interactions, consider a simple example with just two predictors. An interaction between two predictors can be mathematically represented as:

\[ y = \beta_0 + \beta_{1}x_1 + \beta_{2}x_2 + \beta_{3}x_{1}x_{2} + error\] In this equation, \(\beta_0\) represents the overall average response, \(\beta_1\) and \(\beta_2\) represent the average rate of change due to \(x_1\) and \(x_2\), respectively, and \(\beta_3\) represents the incremental rate of change due to the combined effect of \(x_1\) and \(x_2\) that goes beyond what \(x_1\) and \(x_2\) can explain alone. The error term represents the random variation in real data that cannot be explained in the deterministic part of the equation. From data, the \(\beta\) parameters can be estimated using methods like linear regression for a continuous response or logistic regression for a categorical response. Once the parameters have been estimated, the usefulness of the interaction for explaining variation in the response can be determined. There are four possible cases:

If \(\beta_3\) not significantly different from zero, then the interaction between \(x_1\) and \(x_2\) is not useful for explaining variation of the response. In this case, the relationship between \(x_1\) and \(x_2\) is called additive.
If the coefficient is meaningfully negative while \(x_1\) and \(x_2\) alone also affect the response, then interaction is called antagonistic.
At the other extreme, if the coefficient is positive while \(x_1\) and \(x_2\) alone also affect the response, then interaction is called synergystic.
The final scenario occurs when the coefficient for the interaction is significantly different from zero, but either one or both of \(x_1\) or \(x_2\) do not affect the response. In this case, the average response of \(x_1\) across the values of \(x_2\) (or vice versa) has a rate of change that is essentially zero. However, the average value of \(x_1\) at each value of \(x_2\) (or conditioned on \(x_2\)) is different from zero. This situation occurs rarely in real data, and we need to understand the visual cues for identifying this particular case. This type of interaction will be called atypical.

To illustrate the first three types of interactions, simulated data will be generated using the formula from above with coefficients as follows: \(\beta_0 = 0\), \(\beta_1 = \beta_2 = 1\), and \(\beta_3 = -10\), 0, or 10 to illustrate antagonism, no interaction, or synergism, respectively. A random, uniform set of 200 samples were generated for \(x_1\) and \(x_2\) between the values of 0 and 1, and the error for each sample was pulled from a normal distribution. A linear regression model was used to estimate the coefficients and a contour plot of the predicted values from each model was generated (Figure 7.2). In this figure, the additive case has parallel contour lines with an increasing response from lower left to upper right. The synergistic case also has an increasing response from lower left to upper right, but the contours are curved indicating that lower values of each predictor are required to elicit the same response value. The antagonistic case also displays curved contours; here the response decreases from lower left to upper right. What this figure reveals is that the response profile changes when an interaction between predictors is real and present.

Contour plots of the predicted response from a model between two predictors with a synergystic interaction, no interaction (additive), and an antagonistic interaction.

Figure 7.2: Contour plots of the predicted response from a model between two predictors with a synergystic interaction, no interaction (additive), and an antagonistic interaction.

The atypical case is more clearly illustrated when \(x_1\) and \(x_2\) are categorical predictors. To simulate the atypical case, the two predictors will take values of ‘low’ and ‘high’, and the response will be generated by \(y = 2x_1x_2 + error\), where the low and high values of \(x_1\) and \(x_2\) will be numerically represented by -/+ 1, respectively. Figure 7.3 displays the relationship between the low and high levels of the factors. Notice that the average response for \(x_1\) at the low level and separately at the high level is approximately zero. However, the average is conditioned on \(x_2\), represented by the two different colors, the responses are opposite at the two levels of \(x_1\). Here the individual effects of each predictor are not significant, but the conditional effect is strongly significant.

An example of an atypical interaction between two categorical factors. In this case each factor alone is not significant, but the interaction between the factors is significant and is characterized by the tell-tale crossing pattern of the average values within groups.

Figure 7.3: An example of an atypical interaction between two categorical factors. In this case each factor alone is not significant, but the interaction between the factors is significant and is characterized by the tell-tale crossing pattern of the average values within groups.

When we know which predictors interact, they can be included in a model. However, knowledge of which predictors interact may not be available. A number of recent empirical studies have shown that interactions can be uncovered by more complex modeling techniques. As Elith, Leathwick, and Hastie (2008) noted, tree-based models inherently model interactions between predictors through subsequent recursive splits in the tree. Garcı'a-Magariños et al. (2009) showed that random forests were effective at identifying unknown interactions between single nucleotide polymorphisms; Lampa et al. (2014) found that a boosted tree model could uncover important interactions in epidemiology problems; and Chen et al. (2008) found that search techniques combined with a support vector machine was effective at identifying gene-gene interactions related to human disease. These findings prompt the question that if sophisticated predictive modeling techniques are indeed able to uncover the predictive information from interactions, then why should we spend any time or effort in pinpointing which interactions are important? The importance comes back to the underlying goal of feature engineering, which is to create features that improve the effectiveness of a model by containing predictively relevant information. By identifying and creating relevant interaction terms, the predictive ability of models that have better interpretability can be improved. Therefore, once we have worked to improve the form of individual predictors, the focus should then turn to searching for interactions among the predictors that could help explain additional variation in the response and improve the predictive ability of the model.

The focus of this chapter will be to explore how to search for and identify interactions between predictors that improve models’ predictive performance. In this chapter, the Ames housing data will be the focus. The base model that is used consists of a number of variables, including continuous predictors for: general living area, lot area and frontage, years built and sold, pool area, longitude, latitude, and the number of full baths. The base model also contains qualitative predictors for neighborhood, building type (e.g., townhouse, one-family home, etc.), center air, the MS sub-class (e.g., 1 story, 1945 and older, etc.), foundation type, roof style, alley type, garage type, and land contour. The base model consists of main effects and, in this chapter, the focus will be on discovering helpful interactions of these predictors.