3.4 Resampling

As previously discussed, there are times where there is the need to understand the effectiveness of the model without resorting to the test set. Simply repredicting the training set is problematic so a procedure is needed to get an appraisal using the training set. Resampling methods will be used for this purpose.

Resampling methods can generate different versions of our training set that can be used to simulate how well models would perform on new data. These techniques differ in terms of how the resampled versions of the data are created and how many iterations of the simulation process are conducted. In each case, a resampling scheme generates a subset of the data to be used for modeling and another that is used for measuring performance. Here, we will refer to the former as the “analysis set” and the latter as the “assessment set”. They are roughly analogous to the training and test sets described at the beginning of the chapter24. A graphic of an example data hierarchy with three resamples is shown in Figure 3.5.

A diagram of typical data usage with B resamples of the training data.

Figure 3.5: A diagram of typical data usage with B resamples of the training data.

Before proceeding, we should be aware of what is being resampled. The independent experimental unit is the unit of data that is as statistically independent as possible from the other data. For example, for the Ames data, it would be reasonable to consider each house to be independent of the other houses. Since houses are contained in rows of the data, each row is allocated to either the analysis or assessment sets. However, consider a situation where a company’s customer is the independent unit but the data set contains multiple rows per customer. In this case, each customer would be allotted to the analysis or assessment sets and all of their corresponding rows move with them. Such non-standard arrangements are discussed more in Chapter 9.

There are a number of different flavors of resampling that will be described in the next four sections.

3.4.1 V-Fold Cross-Validation and Its Variants

Simple V-fold cross-validation creates V different versions of the original training set that have the same approximate size. Each of the V assessment sets contains 1/V of the training set and each of these exclude different data points. The analysis sets contain the remainder (typically called the “folds”) . Suppose V = 10, then there are 10 different versions of 90% of the data and also 10 versions of the remaining 10% for each corresponding resample.

To use V-fold cross-validation, a model is created on the first fold (analysis set) and the corresponding assessment set is predicted by the model. The assessment set is summarized using the chosen performance measures (e.g., RMSE, the area under the ROC curve, etc.) and these statistics are saved. This process proceeds in a round-robin fashion so that, in the end, there are V estimates of performance for the model and each was calculated on a different assessment set. The cross-validation estimate of performance is computed by averaging the V individual metrics.

Figure 3.6 shows a diagram of 10-fold cross-validation for a hypothetical data set with 20 training set samples. For each resample, two different training set data points are held out for the assessment set. Note that the assessment sets are mutually exclusive and contain different instances.

When the outcome is categorical, stratified splitting techniques can also be applied here to make sure that the analysis and assessment sets produce the same frequency distribution of the outcome. Again, this is a good idea when a continuous outcome is skewed or a categorical outcome is imbalanced, but is unlikely to be problematic otherwise.

For example, for the OkCupid data, stratified 10-fold cross-validation was used. The training set consists of 38,809 profiles and each of the 10 assessment sets contains 3,880 different profiles. The area under the ROC curve was used to measure performance of the logistic regression model previously mentioned. The 10 areas under the curve ranged from 0.83 to 0.854 and their average value was 0.839. Without using the test set, we can use this statistic to forecast how this model would perform on new data.

As will be discussed in Section 3.4.6, resampling methods have different characteristics. One downside to basic V-fold cross-validation is that it is relatively noisier (i.e., has more variability) than other resampling schemes. One way to compensate for this is to conduct repeated V-fold cross-validation. If R repeats are used, V resamples are created R separate times and, in the end, RV resamples are averaged to estimate performance. Since more data are being averaged, the reduction in the variance of the final average would decease by \(\sqrt{R}\) (using a Gaussian approximation25). Again, the noisiness of this procedure is relative and, as one might expect, is driven by the amount of data in the assessment set. For the OkCupid data, the area under the ROC curve was computed from 3,880 profiles and is likely to yield sufficiently precise estimates (even if we only expect about 716 of them to be STEM profiles).

The assessment sets can be used for model validation and diagnosis. Table 3.1 and Figure 3.2 use these holdout predictions to visualize model performance. Also, Section 4.4 has a more extensive description of how the assessment data sets can be used to drive improvements to models.

One other variation, leave-one-out cross-validation, has V equal to the size of the training set. This is a somewhat deprecated technique and may only be useful when the training set size is extremely small (Shao 1993).

In the previous discussion related to the independent experimental unit, an example was given where customers should be resampled and all of the rows associated with each customer would go into either the analysis or assessment sets. To do this, V-fold cross-validation would be used with the customer identifiers. In generally, this is referred to as either “grouped V-fold cross-validation” or “leave-group-out cross-validation” depending on the value of V.

3.4.2 Monte Carlo Cross-Validation

V-fold cross-validation produced V sets of splits with mutually exclusive assessment sets. Monte Carlo resampling produces splits that are likely to contain overlap. For each resample, a random sample is taken with \(\pi\) proportion of the training set going into the analysis set and the remaining samples allocated to the assessment set. Like the previous procedure, a model is created on the analysis set and the assessment set is used to evaluate the model. This splitting procedure is conducted B times and the average of the B results are used to estimate future performance. B is chosen to be large enough so that the average of the B values has an acceptable amount of precision.

A diagram of two types of cross-validation for a training set containing 20 samples.

Figure 3.6: A diagram of two types of cross-validation for a training set containing 20 samples.

Figure 3.6 also shows Monte Carlo fold cross-validation with 10 resamples and \(\pi = 0.90\). Note that, unlike 10-fold cross-validation, some of the same data points are used in different assessment sets.

3.4.3 The Bootstrap

A bootstrap resample of the data is defined to be a simple random sample that is the same size as the training set where the data are sampled with replacement (Davison and Hinkley 1997) This means that when a bootstrap resample is created there is a 63.2% chance that any training set member is included in the bootstrap sample at least once. The bootstrap resample is used as the analysis set and the assessment set, sometimes known as the out-of-bag sample, consists of the members of the training set not included in the bootstrap sample. As before, bootstrap sampling is conducted B times and the same modeling/evaluation procedure is followed to produce a bootstrap estimate of performance that is the mean of B results.

A diagram of bootstrap resampling  for a training set containing 20 samples. The colors represent how many times a data point is replicated in the analysis set.

Figure 3.7: A diagram of bootstrap resampling for a training set containing 20 samples. The colors represent how many times a data point is replicated in the analysis set.

Figure 3.7 shows an illustration of ten bootstrap samples created from a 20-sample data set. The colors show that several training set points are selected multiple times for the analysis set. The assessment set would consist of the rows that have no color.

3.4.4 Rolling Origin Forecasting

This procedure is specific to time-series data or any data set with a strong temporal component (Hyndman and Athanasopoulos 2013). If there are seasonal or other chronic trends in the data, random splitting of data between the analysis and assessment sets may disrupt the model’s ability to estimate these patterns

In this scheme, the first analysis set consists of the first M training set points, assuming that the training set is ordered by time or other temporal component. The assessment set would consist of the next N training set samples. The second resample keeps the data set sizes the same but increments the analysis set to use samples 2 through \(M+1\) while the assessment contains samples \(M+2\) to \(M + N+ 1\) . The splitting scheme proceeds until there is no more data to produce the same data set sizes. Supposing that this results in B splits of the data, the same process is used for modeling and evaluation and, in the end, there are B estimates of performance generated from each of the assessment sets. A simple method for estimating performance is to again use simple averaging of these data. However, it should be understood that, since there is significant overlap in the rolling assessment sets, the B samples themselves constitute a time-series and might also display seasonal or other temporal effects.

A diagram of a rolling origin forcasting resampling with $M$ = 10 and $N$ = 2.

Figure 3.8: A diagram of a rolling origin forcasting resampling with \(M\) = 10 and \(N\) = 2.

Figure 3.8 shows this type of resampling where 10 data points are used for analysis and the subsequent two training set samples are used for assessment.

There are a number of variations of the procedure:

  • The analysis sets need not be the same size. It can cumulatively grow as the moving window proceeds along the training set. In other words, the first analysis set would contain M data points, the second would contain M + 1 and so on. This is the approach taken with the Chicago train data modeling and is described in Chapter 4.
  • The splitting procedure could skip iterations to produce less resamples. For example in the Chicago data, there are daily measurements from 2001 to 2016. Incrementing by one day would produce an excessive value of B. For these data, 13 samples were skipped so that the splitting window moves in two-week blocks instead of by individual day.
  • If the training data are unevenly sampled, the same procedure can be used but moves over time increments rather than data set row increments. For example, the window could move over 12-hour periods for the analysis sets and 2-hour periods for the assessment sets.

This resampling method differs from the previous ones in at least two ways. The splits are not random and the assessment data set is not the remainder of the training set data once the analysis set was removed.

3.4.5 Validation Sets

A validation set is something in-between the training and test sets. Historically, in the neural network literature, there was the recognition that using the same data to estimate parameters and measure performance was problematic. Often, a small set of data was allocated to determine the error rate of the neural network for each iteration of training (also know as a training epoch in that patois). This validation set would be used to estimate the performance as the model is being trained to determine when it begins to overfit. Often, the validation set is a random subset of the training set so that it changes for each iteration of the numerical optimizer. In resampling this type of model, it would be reasonable to use a small, random portion of the analysis set to serve as a within-resample validation set.

Does this replace the test set (or, analogously, the assessment set)? No. Since the validation data are guiding the training process, they can’t be used for a fair assessment for how well the modeling process is working.

3.4.6 Variance and Bias in Resampling

In Section 1.2.5, variance and bias properties of models were discussed. Resampling methods have the same properties but their effects manifest in different ways. Variance is more straightforward to conceptualize. If you were to conduct 10-fold cross-validation many times on the same data set, the variation in the resampling scheme could be measured by determining the spread of the resulting averages. This variation could be compared to a different scheme that was repeated the same number of times (and so on) to get relative comparisons of the amount of noise in each scheme.

Bias is the ability of a particular resampling scheme to be able to hit the true underlying performance parameter (that we will never truly know). Generally speaking, as the amount of data in the analysis set shrinks, the resampling estimate’s bias increases. In other words, the bias in 10-fold cross-validation is smaller than the bias in 5-fold cross-validation. For Monte Carlo resampling, this obviously depends on the value of \(\pi\). However, through simulations, one can see that 10-fold cross-validation has less bias than Monte Carlo cross-validation when \(\pi = 0.10\) and B = 10 are used. Leave-one-out cross-validation would have very low bias since its analysis set is only one sample apart from the training set.

Figure 3.9 contains a graphical representation of variance and bias in resampling schemes where the curves represent the distribution of the resampling statistics if the same procedure was conducted on the same data set many times. Four possible variance/bias cases are represented. We will assume that the model metric being measured here is better when the value is large (such as \(R^2\) or sensitivity) and that the true value is represented by the green vertical line. The upper right panel demonstrates a pessimistic bias since the values tend to be smaller than the true value while the panel below in the lower right shows a resampling scheme that has relatively low variance and the center of its distribution is on target with the true value (so that it nearly unbiased).

In general, for a fixed training set size and number of resamples, simple V-fold cross-validation is generally believed to be the noisiest of the methods discussed here and the bootstrap is the least variable26. The bootstrap is understood to be the most biased (since about 36.8% of the training set is selected for assessment) and its bias is generally pessimistic (i.e., likely to show worse model performance than the true underlying value). There have been a few attempts at correcting the bootstrap’s bias such as Efron (1983) and Efron and Tibshirani (1997).

While V-fold cross-validation does have inflated variance, its bias is fairly low when V is 10 or more. When the training set is not large, we recommend using five or so repeats of 10-fold cross-validation, depending on the required precision, the training set size, and other factors.

An illustration of the variance-bias differences in resampling schemes.

Figure 3.9: An illustration of the variance-bias differences in resampling schemes.

3.4.7 What Should Be Included Inside of Resampling?

In the preceding descriptions of resampling schemes, we have said that the assessment set is used to “build the model”. This is somewhat of a simplification. In order for any resampling scheme to produce performance estimates that generalize to new data, it must contain all of the steps in the modeling process that could significantly affect the model’s effectiveness. For example, in Section 1.1, a transformation was used to modify the predictor variables and this resulted in an improvement in performance. During resampling, this step should be included in the resampling loop. Other preprocessing or filtering steps (such as PCA signal extraction, predictor correlation filters, feature selection methods) must be part of the resampling process in order to understand how well the model is doing and to measure when the modeling process begins to overfit.

There are some operations that can be exempted. For example, in Chapter 2, only a handful of patients had missing values and these were imputed using the median. For such a small modification, we did not include these steps inside of resampling. In general though, imputation can have a substantial impact on performance and its variability should be propagated into the resample results. Centering and scaling can also be exempted from resampling, all other things being equal.

As another example, the OkCupid training data were downsampled so that the class proportions were equal. This is a substantial data processing step and it is important to propagate the effects of this procedure through the resampling results. For this reason, the downsampling procedure is executed on the analysis set of every resample and then again on the entire training set when the final model is created.

One other aspect of resampling is related to the concept of information leakage which is where the test set data are used (directly or indirectly) during the training process. This can lead to overly optimistic results that do not replicate on future data points and can occur in subtle ways.

For example, suppose a data set of 120 samples has a single predictor and is sequential in nature, such as a time series. If the predictor is noisy, one method for preprocessing the data is to apply a moving average to the predictor data and replace the original data with the smoothed values. Suppose a 3-point average is used and that the first and last data points retain their original values. Our inclination would be to smooth the whole sequence of 120 data points with the moving average, split the data into training and test sets (say with the first 100 rows in training). The subtle issue here is that the 100th data point, the last in the training set, uses the first in the test set to compute its 3-point average. Often the most recent data are most related to the next value in the time series. Therefore, including the most recent point will likely bias model performance (in a small way).

To provide a solid methodology, one should constrain themselves to developing the list of preprocessing techniques, estimate them only in the presence of the training data points, and then apply the techniques to future data (including the test set). Arguably, the moving average issue cited above is most likely minor in terms of consequences, but illustrates how easily the test data can creep into the modeling process. The approach to applying this preprocessing technique would be to split the data then apply the moving average smoothers to the training and test sets independently.

Another, more overt path to information leakage, can sometimes be seen in machine learning competitions where the training and test set data are given at the same time. While the test set data often have the outcome data blinded, it is possible to “train to the test” by only using the training set samples that are most similar to the test set data. This may very well improve the model’s performance scores for this particular test set but might ruin the model for predicting on a broader data set.

Finally, with large amounts of data, alternative data usage schemes might be a good idea. Instead of a simple training/test split, multiple splits can be created for specific purposes. For example, a specific split of the data can be used to determine the relevant predictors for the model prior to model building using the training set. This would reduce the need to include the expensive feature selection steps inside of resampling. The same approach could be used for post-processing activities, such as determining an appropriate probability cutoff from a receiver operating characteristic curve.


  1. In fact, many people use the terms “training” and “testing” to describe the splits of the data produced during resampling. We avoid that here because 1) resampling is only ever conducted on the training set samples and 2) the terminology can be confusing since the same term is being used for different versions of the original data.

  2. This is based on the idea that the standard error of the mean has \(\sqrt{R}\) in the denominator.

  3. This is a general trend; there are many factors that affect these comparisons.