10.5 A Case Study

One of our recent collaborations highlights the perils of incorrectly combining feature selection and resampling. In this problem, a researcher had collected 75 samples from each of the two classes. There were approximately 10,000 predictors for each data point. The ultimate goal was to attempt to identify a subset of predictors that had the ability to classify samples into the correct response with an accuracy of at least 80%.

The researcher chose to use 70% of the data for a training set, 10-fold cross-validation for model training, and the implicit feature selection methods of the glmnet and random forest. The best cross-validation accuracy found after a broad search through the tuning parameter space of each model was just under 60%, far from the desired performance. Reducing the dimensionality of the problem might help, so principal component analysis followed by linear discriminant analysis and partial least squares discriminant analysis were tried next. Unfortunately, the cross-validation accuracy of these methods was even worse.

The researcher surmised that part of the challenge in finding a predictive subset of variables was the vast number of irrelevant predictors in the data. The logic was then to first identify and select predictors that had a univariate signal with the response. A t-test was performed for each predictors, and predictors were ranked by significance of separating the classes. The top 300 predictors were selected for modeling.

As convincing proof that this selection approach was able to identify predictors that captured the best predictors for classifying the response, the researcher performed principal component analysis on the 300 predictors and plotted the samples projected onto first two components colored by response category. This plot revealed near perfect classification between groups, and the researcher concluded that there was good evidence of predictors that could classify the samples.

The first two principal components applied to the top features identified using a feature selection routine outside of resampling.  The predictors and response on which this procedure was performed were randomly generated, indicating that feature selection outside of resampling leads to spurious results.

Figure 10.4: The first two principal components applied to the top features identified using a feature selection routine outside of resampling. The predictors and response on which this procedure was performed were randomly generated, indicating that feature selection outside of resampling leads to spurious results.

Regrettably, because feature selection was performed outside of resampling, the apparent signal that was found was simply due to random chance. In fact, the univariate feature selection process used by the researcher can be shown to produce perfect separation between two groups in completely random data. To illustrate this point we have generated a 150 by 10,000 matrix of random standard normal numbers (\(\mu = 0\) and \(\sigma = 1\)). The response was also a random selection such that 75 samples were in each class. A t-test was performed for each predictor and the top 300 predictors were selected. The plot of the samples onto the first two principal components colored by response status is displayed in Figure 10.4.

Clearly having good separation using this procedure does not imply that the selected features have the ability to predict the response. Rather, it is as likely that the apparent separation is due to random chance and an improper feature selection routine.