3.8 Feature Engineering Without Overfitting

Section 3.5 discussed approaches to tuning models using resampling. The crux of this approach was to evaluate a parameter value on data that were not used to build the model. This is not a controversial notion. In a way, the assessment set data are used as a potentially dissenting opinion on the validity of a particular tuning value. It is fairly clear that simply re-predicting the training set does not include such an opportunity.

As will be stated many times in subsequent chapters, the same idea should be applied to any other feature-related activity, such as engineering new features/encodings or when deciding on whether to include a new term into the model. There should always be a chance to have an independent piece of data evaluate the appropriateness of the design choice.

For example, for the Chicago train data, some days had their ridership numbers drastically over-predicted. The reason for the overprediction can be determined by investigating the training data. The root of the problem is due to holidays and ancillary trends that need to be addressed in the predictors (e.g., “Christmas on a Friday” and so on)³⁴. For a support vector machine, the impact of these new features are shown in Figure 3.12. The panels show the assessment set predictions for the two years that were resampled. The dashed line shows the predictions prior to adding holiday predictors and the solid line corresponds to the model after these features have been added. Clearly these predictors helped improve performance during these time slices. This had an obvious commonsense justification but the assessment sets were evaluated to confirm the results.

However, suppose that exploratory data analysis clearly showed that there were decreases in ridership when the Chicago Bulls and Bears had away games. Even if this hypothetical pattern had a very clear signal in the training set, resampling would be the best method to determine if it held up when some of the training data were held back as an unbiased control.

The effect of adding holiday predictors to an SVM model created for the Chicago train data. The dashed line corresponds to the model of the original predictors. The solid line corresponds to the model when an indicator of holiday was included as a predictor.

Figure 3.12: The effect of adding holiday predictors to an SVM model created for the Chicago train data. The dashed line corresponds to the model of the original predictors. The solid line corresponds to the model when an indicator of holiday was included as a predictor.

As another example, in Section 5.6, the OkCupid data were used to derive keyword features from text essay data The approach taken there was to use a small set of the training set (chosen at random) to discover the keyword features. This small subset was kept in the training set and resampled as normal when model fitting. Some of the discovery data set were in the training so we would expect the resampling estimates to be slightly optimistically biased. Despite this, there is still a fairly large amount of data that were not involved in the discovery of these features that are used to determine their validity³⁵.

In any case, it is best to follow up trends discovered during exploratory data analysis with “stress testing” using resampling to get a more objective sense of whether a trend or pattern are likely to be real.

These holiday-related indicators comprise Set 5 in Figure 1.7.↩
As noted in the last paragraph of Section 3.4.7, a better approach would be to allocate a separate data set from the training and test sets and to use this for discovery. This is most useful when there is an abundance of data.↩