9.5 Exploiting Correlation

The previous steps of estimating and reducing baseline and reducing noise help to refine the profiles and enhance the true signal that is related to the response within the profiles. These steps, however, do not reduce the between-wavelength correlation within each sample which is still a problematic characteristic for many predictive models.

Reducing between-wavelength correlation can be accomplished using several previously described approaches in Section 6.3 such as principal component analysis (PCA), kernel PCA, or independent component analysis. These techniques perform dimension reduction on the predictors across all of the samples. Notice that this is a different approach than the steps previously described; specifically, the baseline correction and noise reduction steps are performed within each sample. In the case of traditional PCA, the predictors are condensed in such a way that the variation across samples is maximized with respect to all of the predictor measurements. As noted earlier, PCA is ignorant to the response and may not produce predictive feature. While this technique may solve the issue of between-predictor correlations, it isn’t guaranteed to produce effective models.

PCA dimension reduction applied across all small-scale data. (a) A scree plot of the cumulative variability explained across components.  (b) Scatterplots of Glucose and the first three principal components.

Figure 9.8: PCA dimension reduction applied across all small-scale data. (a) A scree plot of the cumulative variability explained across components. (b) Scatterplots of Glucose and the first three principal components.

When PCA is applied to the small-scale data, the amount of variation in intensity values across wavelengths can be summarized with a scree plot (Figure 9.8(a)). For these data, 11 components explain approximately 80 percent of the predictor variation, while 33 components explain nearly 90 percent of the variation. The relationships between the response and each of the first three new components are shown in Figure 9.8(b). The new components are uncorrelated and are now an ideal input to any predictive model. However none of the first three components have a strong relationship with the response. When faced with highly correlated predictors, and when the goal is to find the optimal relationship between the predictors and the response, a more effective alternative technique is partial least squares (described in Section 6.3.1.5).

A second approach to reducing correlation is through first-order differentiation within each profile. To compute first-order differentiation, the response at the \((p-1)^{st}\) value in the profile is subtracted from the response at the \(p^{th}\) value in the profile. This difference represents the rate of change of the response between consecutive measurements in the profile. Larger changes correspond to a larger movement and could potentially be related to the signal in the response. Moreover, calculating the first-order difference makes the new values relative to the previous value and removes the relationship with values that are 2 or more steps away from the current value. This means that the autocorrelation across the profile should greatly be reduced. This is directly related to the equation shown in the introduction to this chapter; large positive covariances, such as those seem between wavelengths, can drastically reduce noise and variation in the differences.

Autocorrelations before and after taking derivatives of the spectra.

Figure 9.9: Autocorrelations before and after taking derivatives of the spectra.

Figure 9.9 shows the autocorrelation values across the first 200 lags within the profile for the first day and bioreactor before and after the derivatives were calculated. The autocorrelations drop dramatically, and only first 3 lags have correlations greater than 0.95.

The original profile of the first day of the first small-scale bioreactor and the profile of the baseline corrected, standardized, and first-order differenced version of the same bioreactor are compared in Figure 9.10. These steps shows how the within-spectra drift has been removed and most of the trends that are unrelated to the peaks have been minimized.

Spectra for the first day of the first small-scale bioreactor where the preprocessing steps have been sequentially applied.

Figure 9.10: Spectra for the first day of the first small-scale bioreactor where the preprocessing steps have been sequentially applied.

While the number of lagged differences that are highly correlated is small, these will still pose a problem for predictive models. One solution would be to select every \(m^{th}\) profile, where \(m\) is chosen such that the autocorrelation at that lag falls below a threshold such as 0.9 or 0.95. Another solution would be to filter out highly correlated differences using the correlation filter (Section 2) across all profiles.