3.3 Data Splitting

One of the first decisions to make when starting a modeling project is how to utilize the existing data. One common technique is to split the data into two groups typically referred to as the training and testing sets²³. The training set is used to develop models and feature sets; they are the substrate for estimating parameters, comparing models, and all of the other activities required to reach a final model. The test set is used only at the conclusion of these activities for estimating a final, unbiased assessment of the model’s performance. It is critical that the test set not be used prior to this point. Looking at the test sets results would bias the outcomes since the testing data will have become part of the model development process.

How much data should be set aside for testing? It is extremely difficult to make a uniform guideline. The proportion of data can be driven by many factors, including the size of the original pool of samples and the total number of predictors. With a large pool of samples, the criticality of this decision is reduced once “enough” samples are included in the training set. Also, in this case, alternatives to a simple initial split of the data might be a good idea; see Section 3.4.7 below for additional details. The ratio of the number of samples (\(n\)) to the number of predictors (\(p\)) is important to consider, too. We will have much more flexibility in splitting the data when \(n\) is much greater than \(p\). However when \(n\) is less than \(p\), then we can run into modeling difficulties even if \(n\) is seemingly large.

There are a number of ways to split the data into training and testing sets. The most common approach is to use some version of random sampling. Completely random sampling is a straightforward strategy to implement and usually protects the process from being biased towards any characteristic of the data. However this approach can be problematic when the response is not evenly distributed across the outcome. A less risky splitting strategy would be to use a stratified random sample based on the outcome. For classification models, this is accomplished by selecting samples at random within each class. This approach ensures that the frequency distribution of the outcome is approximately equal within the training and test sets. When the outcome is numeric, artificial strata can be constructed based on the quartiles of the data. For example, in the Ames housing price data, the quartiles of the outcome distribution would break the data into four artificial groups containing roughly 230 houses. The training/test split would then be conducted within these four groups and the four different training set portions are pooled together (and the same for the test set).

Non-random sampling can also be used when there is a good reason. One such case would be when there is an important temporal aspect to the data. Here it may be prudent to use the most recent data as the test set. This is the approach used in the Chicago transit data discussed in Section 4.1.

There are other types of subsets, such as a validation set, but these are not explored until later.↩