2 Illustrative Example: Predicting Risk of Ischemic Stroke
As a primer to feature engineering, an abbreviated example is presented with a modeling process similar to the one shown in Figure 1.4. For the purpose of illustration, this example will focus on exploration, analysis fit, and feature engineering, through the lens of a single model (logistic regression).
To illustrate the value of feature engineering for enhancing model performance, consider the application of trying to better predict patient risk for ischemic stroke (in press). Historically, the degree arterial stenosis (blockage) has been used to identify patients who are at risk for stroke (Lian et al. 2012). To reduce the risk of stroke, patients with sufficient blockage (> 70%) are generally recommended for surgical intervention to remove the blockage (Levinson and Rodriguez 1998). However, historical evidence suggests that the degree of blockage alone is actually a poor predictor of future stroke (Meier et al. 2010). This is likely due to the theory that while blockages may be of the same size, the composition of the plaque blockage is also relevant to the risk of stroke outcome. Plaques that are large, yet stable and unlikely to be disrupted may pose less stroke risk than plaques that are smaller, yet less stable.
To study this hypothesis, a historical set of patients with a range of carotid artery blockages were selected. The data consisted of 126 patients, 44 of which had blockages greater than 70%. All patients had undergone Computed Tomography Angiography (CTA) to generate a detailed three-dimensional visualization and characterization of the blockage. These images were then analyzed by Elucid Bioimaging’s vascuCAP (TM) software which generates anatomic structure estimates such as percent stenosis, arterial wall thickness, and tissue characteristics such as lipid-rich necrotic core and calcification. As an example, consider Figure 2.1 (a) which represents a carotid artery with severe stenosis as represented by the tiny tube-like opening running through the middle of the artery. Using the image, the software can calculate the maximal cross-sectional stenosis by area (MaxStenosisByArea) and by diameter (MaxStenosisByDiameter). In addition, The grey area in this figure represents macromolecules (such as collagen, elastin, glycoproteins, and proteoglycans) that provide structural support to the arterial wall. This structural region can be quantified by its area (MATXArea). Figure 2.1 (b) illustrates an artery with severe stenosis and calcified plaque (green) and lipid-rich necrotic core (yellow). Both plaque and lipid-rich necrotic core are thought to contribute to stroke risk, and these regions can be quantified by their volume (CALCVol and LRNCVol) and maximal cross-sectional area (MaxCALCArea and MaxLRNCArea). The artery presented in 2.1 (c) displays severe stenosis and outward arterial wall growth. The top arrow depicts the cross-section of greatest stenosis (MaxStenosisByDiameter) and the bottom arrow depicts the cross-section of greatest positive wall remodeling (MaxRemodelingRatio). Remodeling ratio is a measure of the arterial wall where ratios less than 1 indicate a wall shrinkage and ratios greater than 1 indicate wall growth. This metric is likely important because coronary arteries with large ratios like the one displayed here have been associated with rupture (Cilla et al. 2013, Abdeldayem et al. (2015)). The MaxRemodelingRatio metric captures the region of maximum ratio (between the two white arrows) in the three dimensional artery image. A number of other imaging predictors are generated based on physiologically meaningful characterizations of the artery.
The group of patients in this study also had follow-up information on whether or not a stroke occurred at a subsequent point in time. The association between blockage categorization and stroke outcome is provided in Table 2.1. For these patients, the association is not statistically significant based on a chi-squared test of association (p = 0.46), indicating that blockage categorization alone is likely not a good predictor of stroke outcome.
|Blockage < 70%||43||39|
|Blockage > 70%||19||25|
If plaque characteristics are indeed important for assessing stroke risk, then measurements of the plaque characteristics provided by the imaging software could help to improve stroke prediction. By improving the ability to predict stroke, physicians may have more actionable data to make better patient management or clinical intervention decisions. Specifically, patients who have large (blockages > 70%), but stable plaques may not need surgical intervention, saving themselves from an invasive procedure while continuing on medical therapy. Alternatively, patients with smaller, but less stable plaques may indeed need surgical intervention or more aggressive medical therapy to reduce stroke risk.
The data for each patient also included commonly collected clinical characteristics for risk of stroke such as whether or not the patient had atrial fibrillation, coronary artery disease, and a history of smoking. Demographics of gender and age were included as well. These readily available risk factors can be thought of as another potentially useful predictor set that can be evaluated. In fact, this set of predictors should be evaluated first to assess their ability to predict stroke since these predictors are easy to collect, are acquired at patient presentation, and do not require an expensive imaging technique.
To assess each set’s predictive ability, we will train models using the risk predictors alone, and the risk set together with the imaging predictors and interactions between imaging predictors. We will also explore other representations of these features to extract beneficial predictive information.
Before building these models, we will split the data into one set that will be used to develop models, preprocess the predictors, and explore relationships among the predictors and the response (the training set) and another that will be the final arbiter of the predictor set/model combination performance (the test set). To partition the data, the splitting of the orignal data set will be done in a stratified manner by making random splits in each of the outcome classes. This will keep the proportion of stroke patients approximately the same (Table 2.2). In the splitting, 70% of the data were allocated to the training set.
|Data Set||Stroke = Yes (n)||Stroke = No (n)|
|Train||51% (45)||49% (44)|
|Test||51% (19)||49% (18)|
One of the first steps of the modeling process is to understand important predictor characteristics such as their individual distributions, the degree of missingness within each predictor, potentially unusual values within predictors, relationships between predictors, and the relationship between each predictor and the response. Undoubtedly, as the number of predictors increases, our ability to carefully curate each individual predictor rapidly declines. But automated tools and visualizations are available that implement good practices for working through the initial exploration process such as Kuhn (2008) and Wickham and Grolemund (2016).
For these data, there were only 4 missing values across all subjects and predictors. Many models cannot tolerate any missing values. Therefore we must take action to eliminate missingness to build a variety of models. Imputation techniques replace the missing values with a rational value, and these techniques are discussed in Chapter 9. Here we will replace each missing value with the median value of the predictor, which is a simple, unbiased approach and is adequate for a relatively small amount of missingness.
This data set is small enough to manually explore, and the univariate exploration of the imaging predictors uncovered many interesting characteristics. First, imaging predictors were mean centered and scaled to unit variance to enable direct visual comparisons. Second, many of the imaging predictors had distributions with long tails, also known as positively skewed distributions. As an example, consider the distribution of the maximal cross-sectional area (\(mm^2\)) of lipid-rich necrotic core (MaxLRNCArea, displayed in Figure 2.2a). MaxLRNCArea is a measurement of the mixture of lipid pools and necrotic cellular debris for a cross-section of the stenosis. Initially, we may believe that the skewness and a couple of unusually high measurements are due to a small subset of patients. We may be tempted to remove these unusual values out of fear that these will negatively impact a model’s ability to identify predictive signal. While our intuition is correct for many models, skewness as illustrated here is often due to the underlying distribution of the data. The distribution, instead, is where we should focus our attention. A simple log-transformation, or more complex Box-Cox or Yeo-Johnson transformation, can be used to place the data on a scale where the distribution is approximately symmetric, thus removing the appearance of outliers in the data (Figure 2.2b). This kind of transformation makes sense for measurements that increase exponentially. Here, the lipid area naturally grows multiplicatively by definition of how areas is calculated.
Next, we will remove predictors that are highly correlated (\(r^2\) > 0.9) with other predictors. The correlation among the imaging predictors can be visually seen in the heatmap in Figure 2.3 where the order of the columns and rows are determined by a clustering algorithm. Here, there are three pairs of predictors that show unacceptably high correlations:
- vessel wall volume in \(mm^3\) (WallVol) and matrix volume (MATXVol),
- maximum cross-sectional wall area in \(mm^2\) (MaxWallArea) and maximum matrix area (MatMATXArea)
- maximum cross-sectional stenosis based on area (MaxStenosisByArea) and maximum cross-sectional stenosis based on diameter (MaxStenosisByDiameter).
These three pairs are highlighted in red boxes along the diagonal of the coorleation matrix in Figure 2.3. It is easy to understand why this third pair has high correlation, since the calculation for area is a function of diameter. Hence, we only need one of these representations for modeling. While only 3 predictors cross the high correlation threshold, there are several other pockets of predictors that approach the threshold. For example calcified volume (CALCVol) and maximum cross-sectional calcified area in \(mm^2\) (MaxCALCArea) (r = 0.88), and maximal cross-sectional area of lipid-rich necrotic core (MaxLRNCArea) and volume of lipid-rich necrotic core (LRNCVol) (\(r = 0.81\)) have moderately strong positive correlations but do not cross the threshold. These can be seen in a large block of blue points along the diagonal. The correlation threshold is arbitrary and may need to be raised or lowered depending on the problem and the models to be used. Chapter 3 contains more details on this approach.
The next step is to explore potential predictive relationships between individual predictors and the response and between pairs of predictors and the response.
However, before proceeding to examine the data, a mechanism is needed to make sure that any trends seen in this small data set are not over interpreted. For this reason, a resampling technique will be used. A resampling scheme described in Section 3.4 called repeated 10-fold cross-validation is used. Here, 50 variations of the training set are created and, whenever an analysis is conducted on the training set data, it will be evaluated on each of these resamples. While not infallible, it does offer some protection against overfitting.
For example, one initial question is “which of the predictors have simple associations with the outcome?”. Traditionally, a statistical hypothesis test would be generated on the entire data set and the predictors that show statistical significance would then be slated for inclusion in a model. This problem with this approach is that is it uses all of the data to make this determination and these same data points will be used in the eventual model. The potential for overfitting is strong because of the wholesale reuse of the entire data set for two purposes. Additionally, if predictive performance is the focus, the results of a formal hypothesis test may not be aligned with predictivity.
As an alternative, when we want to compare two models (\(M_1\) and \(M_2\)), the following procedure, discussed more in Section 3.7, will be used:
In this way, the potential improvements in the second model are evaluated over many versions of the training set and the improvements are quantified in a relevant metric.
To illustrate this algorithm, two logistic regression models were considered. The simple model, analogous to the statistical “null model” contains only an intercept term while the model complex model has a single term for an individual predictor from the risk set. These results are presented in Table 2.3 and orders the risk predictors from most significant to least significant in terms of improvement in ROC. For the training data, several predictors provide marginal, but significant improvement in predicting stroke outcome as measured by the improvement in ROC. Based on these results, our intuition would lead us to believe that the significant risk set predictors will likely be integral to the final predictive model.
|Coronary Artery Disease||0.079||0.0006|
Similarly, relationships between the continuous, imaging predictors and stroke outcome can be explored. As with the risk predictors, the predictive performance of the intercept only logistic regression model is compared to the model with each of the imaging predictors. Figure 2.4 presents the scatter of each predictor by stroke outcome with the p-value (top center) of the test to compare the improvement in ROC between the null model and model with each predictor. For this data, the thickest wall across all cross-sections of the target (MaxMaxWallThickness), and the maximum cross-sectional wall remodeling ratio (MaxRemodelingRatio) have the strongest associations with stroke outcome. Let’s consider the results for for MaxRemodelingRatio which indicates that there is a significant shift in the average value between the stroke categories. The scatter of the distributions of this predictor between stroke categories still has considerable overlap. To get a sense of how well MaxRemodelingRatio separates patients by stroke category, the ROC curve for this predictor for the training set data is created (Figure 2.5). The curve indicates that there is some signal for this predictor, but the signal may not be sufficient for using as a prognostic tool.
At this point, one could move straight to tuning predictive models on the preprocessed and filtered the risk and/or the imaging predictors for the training set to see how well the predictors together identify stroke outcome. This often is the next step that practitioners take to understand and quantify model performance. However, there are more exploratory steps can be taken to identify other relevant and useful constructions of predictors that improve a model’s ability to predict. In this case, the stroke data in its original form does not contain direct representations of interactions between predictors. Pairwise interactions between predictors are prime candidates for exploration and may contain valuable predictive relationships with the response.
For data that has a small number of predictors, pairwise interactions can be created. For numeric predictors, the interactions are simply generated by multiplying the values of each predictor. These new terms can be added to the data and passed in as predictors to a model. This approach can be practically challenging as the number of predictors increases, and other approaches to sift through the vast number of possible pairwise interactions might be required (see Chapter 7 for details). For this example, an interaction term was created for every possible pair of original imaging predictors (171 potentially new terms). For each interaction term, the same resampling algoroithm was used to quantify the cross-validated ROC from a model with only the main effects and a model with the main effects and the interaction term. The improvement in ROC as well as a p-value of the interaction model versus the main effects model was calculated. Figure 2.6 displays the relationship between the improvement in ROC due to the interaction term (on the x-axis) and the log10 p-value of the improvement (y-axis where larger is more significant). Clicking on a point will show the interaction term. The size of the points symbolizes the baseline area under the ROC curve from the main effects models. Points that have relatively small symbols indicate whether the improvement was to a model that had already showed good performance.
Of the 171 interaction terms, the filtering process indicated that 16 provided an improvement over the main effects alone. One of these interaction terms is between MaxRemodelingRatio and MaxStenosisByArea. Figure 2.7 illustrates the relationship between these two predictors. Panel (a) is a scatter plot of the two predictors, where the training set data is colored by stroke outcome. The contour lines represent equivalent product values between the two predictors which helps to highlight the characteristic that patients that did not have a stroke outcome generally have lower product values of these two predictors. Alternatively patients with a stroke outcome generally have higher product outcomes. Practically, this means that a significant blockage combined with outward wall growth in the vessel increases risk for stroke. The boxplot of this interaction in (b) demonstrates that the separation between stroke outcome categories is stronger than for either predictor alone.
At this point there are at least five progressive combinations of predictor sets that could be explored for their predictive ability: original risk set alone, imaging predictors alone, risk and imaging predictors together, imaging predictors and interactions of imaging predictors, and risk, imaging predictors, and interactions of imaging predictors. We can begin to visually understand the predictive ability of these multi-dimensional sets to separate patients by stroke outcome using dimension reduction techniques. Partial least squares discriminant analysis (PLSDA) reduces dimension of the predictor space by simultaneously trying to account for the most predictor set variation and trying to separate the samples into the response categories. This technique is particularly effective when groups are linearly separated in high dimensional space and the data is contained in a lower-dimensional space. The dimensionality reduction of PLS enables us to quickly see a two-dimensional view of the separation between groups that may be occurring in the higher dimensional space. Ideally, if separation is occurring in higher dimensional space, this will be reflected in the two-dimensional projection. Figure 2.8 compares the first two PLS components for the risk set alone and for the imaging set plus imaging interactions. The black line represents the model discrimination boundary, and the background shading represents the model predicted classification. Samples that are of the same color as the background are correctly classified, whereas samples of opposite color of the background are incorrectly classified. This initial first-pass visualization begins to show that imaging predictors and their interactions begins to improve the separation between groups and organize the subjects into tighter groups in the projection of the model on the first two PLS dimensions.
2.4 Predictive Modeling Across Sets
Physicians have an strong preference towards logistic regression due to its inherent interpretability. However, it is well known that logistic regression is a high-bias, low-variance model which has a tendency to yield lower predictivity than other low-bias, high-variance models. The predictive performance of logistic regression is also degraded by the inclusion of correlated, non-informative predictors. In order to find the most predictive logistic regression model, the most relevant predictors should be identified to find the best subset for predicting stroke risk.
With these specific issues in mind for this data, a recursive feature elimination (RFE) routine (Chapter 11) was used to determine if less predictors would be advantageous. RFE is a simple backwards selection procedure where the largest model is used initially and, from this model, each predictor is ranked in importance. For logistic regression, there are several methods for determining importance, and we will use the simple absolute value of the regression coefficient for each model term. The RFE procedure begins to remove the least important predictors, refits the model, and evaluates performance. At each model fit, the predictors are preprocessed by an initial Yeo-Johnson transformation as well as centering and scaling.
Our previous resampling scheme was used in conjunction with the RFE process. This means that the backwards selection was performed 50 different times on 90% of the training set and the remaining 10% was used to evaluate the effect of removing the predictors. The optimal subset size is determined using these results and the final RFE execution is one the entire training set and stops at the optimal size. Again, the goal of this resampling process is to reduce the risk of overfitting in this small data set.
To understand the predictive performance of logistic regression, RFE was first applied to the original risk predictors. The RFE determined that the best model required all 8 predictors and was associated a cross-validated area under the ROC curve of 0.632. This estimate of the area under the ROC curve is based on 50 different assessments of the entire risk predictor set from the different resamples.
Based on our exploratory work, one would anticipate that the imaging predictors and interactions between pairs of the imaging predictors could likely improve model performance. To explore this possibility, RFE was also applied to the augmented predictor set. The RFE cross-validation profile across predictor sets is provided in Figure 2.10.
Table 2.4 lists the total number of predictors in each set, the number of predictors selected by RFE, and the cross-validated area under the ROC curve. The ROC curves for each set are provided in Figure 2.9. The cross-validation results indicate that there is improvement in predictive performance by including imaging predictors and interactions among imaging predictors. It is also interesting to note that the model of the risk set requires all 8 predictors while recursive feature elimination for the risk, imaging predictors and imaging predictor interactions set only requires only 4 predictors to achieve a better cross-validated area under the ROC curve. The top two selected predictors are MaxMaxWallThickness and MATXVol_MaxMaxWallThickness12. What is interesting here is that the interaction term between MaxRemodelingRatio and MaxStenosisByArea provided one of the strongest individual separation in stroke categorization and was, in fact, selected by the RFE process. Clearly screening interaction terms are a valuable asset to the model.
The resampling process indicates that the two predictor logistic regression model has an associated area under the ROC curve of 0.67. The resampling profile in Figure 2.10 shows an increasing trend in the AUC as the number of predictors is decreased which indicates that, for these data, the full logistic regression model was impaired by including non-informative and/or correlated predictors.
|Data Set||Total Predictors||Selected Predictors||ROC|
2.5 Other Considerations
The approach presented here is not the only approach that could have been taken with this data. For example, if logistic regression is the model being evaluated, the
glmnet model (Hastie, Tibshirani, and Wainwright 2015) is a model that incorporates feature selection into the logistic regression fitting process13. Also, you may be wondering why we chose to preprocess only the imaging predictors, why we did not explore interactions among risk predictors or between risk predictors and imaging predictors, why we constructed interaction terms on the original predictors and not on the preprocessed predictors, or why we did not employ a different modeling technique or feature selection routine. And you would be right to ask these questions. In fact, there may be a different preprocessing approach, a different combination of predictors, or a different modeling technique that would lead to a better predictivity. Our primary point in this short tour is to illustrate that spending a little more time (and sometimes a lot more time) investigating predictors and relationships among predictors can help to improve model predictivity. This is especially true when marginal gains in predictive performance can have significant benefits. In the chapters that follow, we will revisit this example and provide the test-set, bake-off results.
Lian, Kevin, Jeremy H White, Eric S Bartlett, Aditya Bharatha, Richard I Aviv, Allan J Fox, and Sean P Symons. 2012. “NASCET Percent Stenosis Semi-Automated Versus Manual Measurement on Cta.” The Canadian Journal of Neurological Sciences 39 (03). Cambridge Univ Press:343–46.
Levinson, Mark M, and DI Rodriguez. 1998. “Endarterectomy for Preventing Stroke in Symptomatic and Asymptomatic Carotid Stenosis. Review of Clinical Trials and Recommendations for Surgical Therapy.” In The Heart Surgery Forum, 147–68.
Meier, Pascal, Guido Knapp, Umesh Tamhane, Seemant Chaturvedi, and Hitinder S Gurm. 2010. “Short Term and Intermediate Term Comparison of Endarterectomy Versus Stenting for Carotid Artery Stenosis: Systematic Review and Meta-Analysis of Randomised Controlled Clinical Trials.” BMJ 340. British Medical Journal Publishing Group:c467.
Cilla, M, E Pena, MA Martinez, and DJ Kelly. 2013. “Comparison of the Vulnerability Risk for Positive Versus Negative Atheroma Plaque Morphology.” Journal of Biomechanics 46 (7). Elsevier:1248–54.
Abdeldayem, Emad H, Ahmed S Ibrahim, Amr M Ahmed, Eman S Genedi, and Wahid H Tantawy. 2015. “Positive Remodeling Index by Msct Coronary Angiography: A Prognostic Factor for Early Detection of Plaque Rupture and Vulnerability.” The Egyptian Journal of Radiology and Nuclear Medicine 46 (1). Elsevier:13–24.
Kuhn, Max. 2008. “The caret Package.” Journal of Statistical Software 28 (5):1–26.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science. O’Reilly. http://r4ds.had.co.nz.
Hastie, Trevor, Robert Tibshirani, and Martin Wainwright. 2015. Statistical Learning with Sparsity. CRC press.