2.2 Preprocessing

One of the first steps of the modeling process is to understand important predictor characteristics such as their individual distributions, the degree of missingness within each predictor, potentially unusual values within predictors, relationships between predictors, and the relationship between each predictor and the response and so on. Undoubtedly, as the number of predictors increases, our ability to carefully curate each individual predictor rapidly declines. But automated tools and visualizations are available that implement good practices for working through the initial exploration process such as Kuhn (2008) and Wickham and Grolemund (2016).

For these data, there were only 4 missing values across all subjects and predictors. Many models cannot tolerate any missing values. Therefore we must take action to eliminate missingness to build a variety of models. Imputation techniques replace the missing values with a rational value, and these techniques are discussed in Chapter 8. Here we will replace each missing value with the median value of the predictor, which is a simple, unbiased approach and is adequate for a relatively small amount of missingness (but nowhere near optimal).

This data set is small enough to manually explore, and the univariate exploration of the imaging predictors uncovered many interesting characteristics. First, imaging predictors were mean centered and scaled to unit variance to enable direct visual comparisons. Second, many of the imaging predictors had distributions with long tails, also known as positively skewed distributions. As an example, consider the distribution of the maximal cross-sectional area (\(mm^2\)) of lipid-rich necrotic core (MaxLRNCArea, displayed in Figure 2.2a). MaxLRNCArea is a measurement of the mixture of lipid pools and necrotic cellular debris for a cross-section of the stenosis. Initially, we may believe that the skewness and a couple of unusually high measurements are due to a small subset of patients. We may be tempted to remove these unusual values out of fear that these will negatively impact a model’s ability to identify predictive signal. While our intuition is correct for many models, skewness as illustrated here is often due to the underlying distribution of the data. The distribution, instead, is where we should focus our attention. A simple log-transformation, or more complex Box-Cox or Yeo-Johnson transformation (Section 6.1), can be used to place the data on a scale where the distribution is approximately symmetric, thus removing the appearance of outliers in the data (Figure 2.2b). This kind of transformation makes sense for measurements that increase exponentially. Here, the lipid area naturally grows multiplicatively by definition of how areas is calculated.

(a) Distribution of maximal cross-sectional area of lipid-rich necrotic core (b) The distribution of the Yeo-Johnson transformed maximal cross-sectional area of lipid-rich necrotic core.

Figure 2.2: (a) Distribution of maximal cross-sectional area of lipid-rich necrotic core (b) The distribution of the Yeo-Johnson transformed maximal cross-sectional area of lipid-rich necrotic core.

Figure 2.3: Visualization of the imaging predictor correlation matrix.

Next, we will remove predictors that are highly correlated (\(r^2\) > 0.9) with other predictors. The correlation among the imaging predictors can be visually seen in the heatmap in Figure 2.3 where the order of the columns and rows are determined by a clustering algorithm. Here, there are three pairs of predictors that show unacceptably high correlations:

vessel wall volume in \(mm^3\) (WallVol) and matrix volume (MATXVol),
maximum cross-sectional wall area in \(mm^2\) (MaxWallArea) and maximum matrix area (MaxMATXArea)
maximum cross-sectional stenosis based on area (MaxStenosisByArea) and maximum cross-sectional stenosis based on diameter (MaxStenosisByDiameter).

These three pairs are highlighted in red boxes along the diagonal of the corrleation matrix in Figure 2.3. It is easy to understand why this third pair has high correlation, since the calculation for area is a function of diameter. Hence, we only need one of these representations for modeling. While only 3 predictors cross the high correlation threshold, there are several other pockets of predictors that approach the threshold. For example calcified volume (CALCVol) and maximum cross-sectional calcified area in \(mm^2\) (MaxCALCArea) (r = 0.87), and maximal cross-sectional area of lipid-rich necrotic core (MaxLRNCArea) and volume of lipid-rich necrotic core (LRNCVol) (\(r = 0.8\)) have moderately strong positive correlations but do not cross the threshold. These can be seen in a large block of blue points along the diagonal. The correlation threshold is arbitrary and may need to be raised or lowered depending on the problem and the models to be used. Chapter 3 contains more details on this approach.