2.4 Predictive Modeling Across Sets
At this point there are at least five progressive combinations of predictor sets that could be explored for their predictive ability: original risk set alone, imaging predictors alone, risk and imaging predictors together, imaging predictors and interactions of imaging predictors, and risk, imaging predictors, and interactions of imaging predictors. We’ll consider modeling several of these sets of data in turn.
Physicians have a strong preference towards logistic regression due to its inherent interpretability. However, it is well known that logistic regression is a high-bias, low-variance model which has a tendency to yield lower predictivity than other low-bias, high-variance models. The predictive performance of logistic regression is also degraded by the inclusion of correlated, non-informative predictors. In order to find the most predictive logistic regression model, the most relevant predictors should be identified to find the best subset for predicting stroke risk.
With these specific issues in mind for these data, a recursive feature elimination (RFE) routine (Chapters 10 and 11) was used to determine if less predictors would be advantageous. RFE is a simple backwards selection procedure where the largest model is used initially and, from this model, each predictor is ranked in importance. For logistic regression, there are several methods for determining importance, and we will use the simple absolute value of the regression coefficient for each model term (after the predictors have been centered and scaled). The RFE procedure begins to remove the least important predictors, refits the model, and evaluates performance. At each model fit, the predictors are preprocessed by an initial Yeo-Johnson transformation as well as centering and scaling.
As will be discussed in later chapters, correlation between the predictors can cause instability in the logistic regression coefficients. While there are more sophisticated approaches, an additional variable filter will be used on the data to remove the minimum set of predictors such that no pairwise correlations between predictors are greater than 0.75. The data preprocessing will be conducted with and without this step to show the potential effects on the feature selection procedure.
Our previous resampling scheme was used in conjunction with the RFE process. This means that the backwards selection was performed 50 different times on 90% of the training set and the remaining 10% was used to evaluate the effect of removing the predictors. The optimal subset size is determined using these results and the final RFE execution is one the entire training set and stops at the optimal size. Again, the goal of this resampling process is to reduce the risk of overfitting in this small data set. Additionally, all of the preprocessing steps are conducted within these resampling steps so that there is the potential that the correlation filter may select different variables for each resample. This is deliberate and is the only way to measure the effects and variability of the preprocessing steps on the modeling process.
The RFE procedure was applied to:
The small risk set of 8 predictors. Since this is not a large set, an interaction model with potentially all 28 pairwise interactions. When the correlation filter is applied, the number of model terms might be substantially reduced.
The set of 19 imaging predictors. The interaction effects derived earlier in the chapter are also considered for these data.
The entire set of predictors. The imaging interactions are also combined with these variables.
Figure 2.8 shows the results. The risk set, when only main effects are considered, the full set of 8 predictors is favored. When the full set of 28 pairwise interactions are added, model performance was hurt by the extra interactions. Based on resampling, a set of 16 predictors was optimal (13 of which were interactions). When a correlation filter was applied, the main effect model was unaffected while the interaction model has, at most 18 predictors. Overall, the filter did not help this predictor set.
For the imaging predictor set, the data set preferred a model with none of the previously discovered interactions and a correlation filter. This may seem counter-intuitive but understand that the interactions were discovered in the absence of other predictors or interactions. The non-interaction terms appear to compensate or replace the information supplied by the most important interaction terms. the best model so far is based on the filtered set of 12 imaging main effects.
When combining the two predictors sets, model performance without a correlation filter was middle-of-the-road and there was no real difference between the interaction and main effects models. Once the filter was applied, the data strongly favored the main effects model (with all 8 predictors that survived the correlation filter).
Using this training set, we estimated that the filtered predictor set of 12 imaging predictors was our best bet. The final predictor set was MaxLRNCArea, MaxLRNCAreaProp, MaxMaxWallThickness, MaxRemodelingRatio, MaxStenosisByArea, MaxCALCAreaProp, MaxMATXArea, MATXVolProp, CALCVolProp, MaxDilationByArea, MaxMATXAreaProp, and MATXVol. To understand the variation in the selection process, Table 2.4 shows the frequency of the predictors that were selected in the 12 variable models across all 50 resamples. The selection results were fairly consistent especially for a training set this small.
|Predictor||Number of Times Selected||In Final Model?|
How well did this predictor set do on the test set? The test set area under the ROC curve was estimated to be 0.693. This is less than the resampled estimate of 0.74 but is greater than the estimated 90% lower bound on this number (0.666).