3 A Review of the Predictive Modeling Process

Before diving in to specific methodologies and techniques for modeling, there are necessary topics that should first be discussed and defined. These topics are fairly general with regards to empirical modeling and include: metric for measuring performance for regression and classification problems, approaches for optimal data usage which includes data splitting and resampling, best practices for model tuning, and recommendations for comparing model performance.

There are two data sets used in this chapter to illustrate the techniques. First is the Ames housing price data first introduced in Chapter 1. The second data set focuses on the classification of a person’s profession based on the information from an online dating site. These data are discussed in the next section.

3.1 Illustrative Example: OkCupid Profile Data

OkCupid is an online dating site that serves international users. Kim and Escobedo-Land (2015) describe a data set where over 50,000 profiles from the San Francisco area were made available14 and the data can be found in a GitHub repository15. The data contains several types of variables:

  • open text essays related to an individual’s interests and personal descriptions,
  • single choice type fields such as profession, diet, gender, body type, and education, and
  • multiple choice fields such as languages spoken and fluency in programming languages.

In their original form, almost all of raw data fields are discrete in nature; only age was numeric. The categorical predictors were converted to dummy variables (see Chapter 5) prior to fitting models (in this chapter at least). For the analyses of these data in this chapter, the open text data will be ignored but will be probed later (see Section 5.5). Of the 307 predictors that were used, there were clusters of variables for geographic location (i.e. town, \(p = 66\)), religious affiliation (\(p = 13\)), astrological sign (\(p =18\)), children (\(p =15\)), pets (\(p =15\)), income (\(p =12\)), education (\(p =31\)), body type (\(p =12\)), diet (\(p =17\)), and over 50 variables related to spoken languages. For more information on this data set and how it was processed, see the book’s GitHub repository.

For this demonstration, the goal will be to predict whether a person’s profession is in the STEM fields (science, technology, engineering, and math). There is a severe class imbalance in these data; only 18.5% of profiles work in these areas. While the imbalance hasa significant impact on the analysis, the illustration presented here will mostly side-step this issue by down-sampling the instances such that the number of profiles in each class are equal. See Chapter 16 of Kuhn and Johnson (2013) for a detailed description of techniques for dealing with infrequent classes.

3.2 Measuring Performance

While often overlooked, the metric used to assess the effectiveness of a model to predict the outcome is very important and can influence the conclusions. The metric we select to evaluate model performance depends on the outcome, and the subsections below describe the main statistics that are used.

3.2.1 Regression Metrics

When the outcome is a number, the most common metric is the root mean squared error (RMSE). To calculate this value, we first build a model and then use this model to predict the outcome. The residuals are the difference between the observed outcome and predicted outcome values. To get the RMSE for a model, we compute the average of the squared residuals, then we take the square root of this value. Taking the square root puts the metric back into the original measurement units. We can think of RMSE as the average distance of a sample from it’s observed value to it’s predicted value. Simply put, the lower the RMSE, the better a model can predict samples’ outcomes.

Another popular metric is the coefficient of determination, usually known as \(R^2\). There are several formulas for computing this value (Kvalseth 1985), but the most conceptually simple one finds the standard correlation between the observed and predicted values (a.k.a. \(R\)) and squares it. The benefit of this number is, for linear models, it has a straightforward interpretation: \(R^2\) is the proportion of the total variability in the outcome that can be explained by the model. A value near 1.0 indicates an almost perfect fit while values near zero result from a model where the predictions have no linear association with the outcome. One other advantage of this number is that it makes comparisons between different outcomes easy since it is unitless.

Unfortunately, \(R^2\) can be a deceiving metric. The main problem is that it is a measure of correlation and not accuracy. When assessing the predictive ability of a model, we need to know how well the observed and predicted values agree. It is possible, and not unusual, that a model could produce predicted values that have a strong linear relationship with the observed values but the predicted values do not conform to the 45 degree line of agreement. One example of this phenomenon occurs when a model under-predicts at one extreme of the outcome and overpredicts at the other extreme of the outcome. Tree-based ensemble methods (e.g. random forest, boosted trees, etc.) are notorious for these kinds of predictions. A second problem with using \(R^2\) as a performance metric is that it can show very optimistic results when the outcome has large variance. Finally, \(R^2\) can be misleading if there are a handful of outcome values that are far away from the overall scatter of the observed and predicted values. In this case the handful of points can artificially increase \(R^2\).

To illustrate the problems with \(R^2\), let’s look at the results of one particular model of the Chicago train ridership data. For this model \(R^2\) was estimated to be 0.92; at face value we may conclude that this is an extremely good model. However, the high value is mostly due to the inherent nature of the ridership numbers which are high during the workweek and correspondingly low on the weekends. The bimodal nature of the outcome inflates the outcome variance and, in turn, the \(R^2\). We can see the impacts of the bi-modal outcome in Figure 3.1 (a). Part (b) of the figure displays a histogram of the residuals, some of which are greater than 10K rides. The RMSE for this model is 2269 rides, which is somewhat large relative to the observed ridership values.

A second illustration of the problem of using \(R^2\) can be seen by examining the blue and black lines in Figure 3.1(a). The blue line is the linear regression fit between the observed and predicted values, while the black line represents the line of agreement. Here we can see that the model under-predicts the smaller observed values (left) and over-predicts the larger observed values (right). In this case, the offset is not huge but it does illustrate how the RMSE and \(R^2\) metrics can produce discordant results. For these reasons, we advise using RMSE instead of \(R^2\).

To address the problem that the correlation coefficient is overly optimistic when the data illustrates correlation but not agreement, Lawrence and Lin (1989) developed the concordance correlation coefficient (CCC). This metric provides a measure of correlation relative to the line of agreement and is defined as the product of the usual correlation coefficient and a measure of bias from the line of agreement. The bias coefficient ranges from 0 to 1, where a value of 1 indicates that the data falls on the line of agreement. The further the data deviates from the line of agreement, the smaller the bias coefficient. Therefore, the CCC can be thought of as penalized version of the correlation coefficient. The penalty will apply if the data exhibits poor correlation between the observed and predicted values or if the relationship between the observed and predicted values is far from the line of agreement.

Observed and predicted values of Chicago train ridership (in thousands, panel a) along with a histogram of the corresponding model residuals (observed minus predicted, panel b).  In panel a, the blue line corresponds to the linear regression line between the observed and predicted values, while the black line represents the line of agreement.  $R^2$ is high due to the two clusters of days on the extremes of the ridership range.  For this data, RMSE is a more realistic performance measure.

Fig. 3.1: Observed and predicted values of Chicago train ridership (in thousands, panel a) along with a histogram of the corresponding model residuals (observed minus predicted, panel b). In panel a, the blue line corresponds to the linear regression line between the observed and predicted values, while the black line represents the line of agreement. \(R^2\) is high due to the two clusters of days on the extremes of the ridership range. For this data, RMSE is a more realistic performance measure.

Both RMSE and \(R^2\) are very sensitive to extreme values because each are based on the squared value of the individual samples’ residuals. Therefore a sample with a large residual will have an inordinately large effect on the resulting summary measure. In general, this type of sample makes the model performace metric appear worse that what it would be without the sample. Depending on the problem at hand, this characteristic is not necessarily a vice but could be a virtue. For example, if the goal of the modeling problem is to rank-order new data points (e.g. the highest spending customers), then size of the residual is not an issue so long as the most extreme values are predicted to be the most extreme. However, it is more often the case that we are interested in predicting the acutal response value rather than just the rank. In this case, we need metrics that are not skewed by one or just a handful of extreme values. The field of robustness was developed to study the effects of extreme values (i.e. outliers) on commonly used statistical metrics and to derive alternative metrics that achieved the same purpose but were less sensitive or insensitive to the impact of outliers (Hampel et al. 1972). As a broad description, robust techniques seek to find numerical summaries for the majority of the data. To lessen the impact of extreme values, robust approaches down weight the extreme samples or they transform the original values in a way that brings the extreme samples closer to the majority of the data. Rank-ordering the samples is one type of transformation that reduces the impact of extreme values. In the hypothetical case of predicting customers’ spending, rank correlation might be a better choice of metric for the model since it measures how well the predictions rank order with their true values. This statistic computes the ranks of the data (e.g. 1, 2, etc.) and computes the standard correlation statistic from these values. Other robust measures for regression are the median absolute deviation (MAD) (Rousseeuw and Croux 1993) and the absolute error.

3.2.2 Classification Metrics

Table 3.1: A confusion matrix for an OkCupid model. The columns are the true classes and the rows correspond to the predictions.
stem other
stem 5231 9261
other 1936 22381

When the outcome is a discrete set of values (i.e. qualitative data), there are two different types of performance metrics that can be utilized. The first type described below is based on qualitative class prediction (e.g. stem or other) while the second type uses the predicted class probabilities to measure model effectiveness (e.g. Pr[stem = 0.254]).

Given a set of predicted classes, the first step in understanding how well the model is working is to create a confusion matrix which is a simple cross-tabulation of the observed and predicted classes. For the OkCupid data, a simple logistic regression model was built using the predictor set mentioned above and Table 3.1 shows the resulting confusion matrix16.

The samples that were correctly predicted sit on the diagonal of the table. The STEM profiles mistakenly predicted as non-STEM are shown in the bottom left of the table (n = 1936) while the non-STEM profiles that were erroneously predicted are in the upper right cell (n = 9261) . The most widely utilized metric is classification accuracy which is simply the proportion of the outcome that were correctly predicted. In this example, the accuracy is 0.71 = (5231 + 22381)/(5231 + 9261 + 1936 + 22381). There is an implicit tendency to assess model performance by comparing the observed accuracy value to 1/C, where C is the number of classes. In this case, 0.71 is much greater than 0.5. However, this comparison should be made only when there are nearly the same number of samples in each class. When there is an imbalance between the classes as there is in this data, accuracy can be a quite deceiving measure of model performance since a value of 0.82 can be achieved by predicting all profiles as non-STEM.

As an alternative to accuracy, another statistic called Cohen’s Kappa (Agresti 2012) can be used to account for class imbalances. This metric normalizes the error rate to what would be expected by chance. Kappa takes on values between -1 and 1 where a value of 1 indicates complete concordance between the observed and predicted values (and thus perfect accuracy). A value of -1 is complete discordance and is rarely seen17. Values near zero indicate that there is no relationship between the model predictions and the true results. The Kappa statistic can also be generalized to problems that have more than two groups.

A visualization technique that can be used for confusion matrices is the mosaic plot (see Figure 3.3). In these plots, each cell of the table is represented as a rectangle whose area is proportional to the number of values in the cell. These plots can be rendered in a number of different ways and for tables of many sizes. See Friendly and Meyer (2015) for more examples.

There are also specialized sets of classification metrics when the outcome has two classes. To use them, one of the class values must be designated as the event of interest. This is somewhat subjective. In some cases, this value might be the worst case scenario (i.e. death) but the designated event should be the value that one is most interested in predicting.

The first paradigm of classification metrics focuses on false positives and false negatives and is most useful when there is interest in comparing the two types of errors. The metric sensitivity is simply the proportion of the events that were predicted correctly and is the true positive rate in the data. For our example,

\[ sensitivity = \frac{\text{# truly STEM predicted correctly}}{\text{# truly STEM}} = 5,231/7,167 = 0.73\]

The false positive rate is associated with the specificity, which is

\[ specificity = \frac{\text{# truly non-STEM predicted correctly}}{\text{# truly non-STEM}} = 22,381/31,642 = 0.707\]

The false positive rate is 1 - specificity (0.293 in this example).

The other paradigm for the two class system is rooted in the field of information retrieval where the goal is to find the events. In this case, the metrics commonly used are precision and recall. Recall is equivalent to sensitivity and focuses on the number of true events found by the model. Precision is the proportion of events that are predicted correctly out of the total number of predicted events, or

\[ precision = \frac{\text{# truly non-STEM predicted correctly}}{\text{# predicted STEM}} = 5,231/14,492 = 0.361\]

One facet of sensitivity, specificity, and precision that is worth understanding is that they are conditional statistics. For example, sensitivity reflects the probability than an event is correctly predicted given that a sample is truly an event. The latter part of this sentence shows the conditional nature of the metric. Of course, the true class is usually unknown and, if it were known, a model would not be needed. In any case, if Y denotes the true class and P denotes the prediction, we could write sensitivity as Pr[P = STEM|Y = STEM].

The question that one really wants to know is “if my value was predicted to be an event, what is are the chances that it is truly is an event?” or Pr[Y = STEM|P = STEM]. Thankfully, the field of Bayesian analysis (McElreath 2016) has an answer to this question. In this context, Bayes’ Rule states that

\[Pr[Y|P] = \frac{Pr[Y] \times Pr[P|Y]}{Pr[P]} = \frac{Prior \times Likelihood}{Evidence}\]

Sensitivity (or specificity, depending on one’s point of view) are the “likelihood” parts of this equation. The prior probability, or prevalence, is the overall rate that we see events in the wild (which may be different from what was observed in our training set). Usually, one would specify the overall event rate before data are collected and use it in the computations to determine the unconditional statistics. For sensitivity, its unconditional analog is called the positive predictive value (PPV):

\[PPV = \frac{sensitivity \times prevalence}{(sensitivity\times prevalence) + ((1-specificity)\times (1-prevalence))}\]

The negative predictive value (NPV) is the analog to specificity and can be computed as

\[NPV = \frac{specificity \times (1-prevalence)}{((1-sensitivity)\times prevalence) + (specificity\times (1-prevalence))}\]

See DG Altman and Bland (1994b) for a clear and concise discussion of these measures. Also, simplified versions of these formulas are often shown for these statistics that assume the prevalence to be 0.50. These formulas, while correct when prevalence is 0.50, can produce very misleading results if the prevalence is different from this number.

For the OkCupid data, the difference in the sensitivity and PPV are:

  • sensitivity: if the profile is truly STEM, what is the probability that it is correctly predicted?
  • PPV: if the profile was predicted as STEM, what is the probability that it is STEM?

The positive and negative predictive values are not often used to measure performance. This is partly due to the nature of the prevalence. If the outcome is not well understood, it is very difficult to provide a value (even when asking experts). When there is a sufficient amount of data, the prevalence is typically estimated by the proportion of the outcome data that correspond to the event of interest. Also, in other situations, the prevalence may depend on certain factors. For example, the proportion of STEM profiles in the San Francisco area can be estimated from the training set to be 0.18. Using this value as the prevalence, our estimates are PPV = 0.36 and NPV = 0.92. The PPV is significantly smaller than the sensitivity due to the model missing almost 27% of the true STEM profiles and the fact that the overall likelihood of being in the STEM fields is already fairly low.

The prevalence of people in STEM professions in San Francisco is likely to be larger than in other parts of the country. If we thought that the overall STEM prevalence in the United States were about 5%, then our estimates would change to PPV = 0.12 and NPV = 0.98. These computations only differ by the prevalence estimates and demonstrate how the smaller prevalence affects the unconditional probabilities of our results.

The metrics discussed so far depend on having a hard prediction (e.g. STEM or other). Most classification models can produce class probabilities as soft predictions that can be converted to a definitive class by choosing the class with the largest probability. There are a number of metrics that can be created using the probabilities.

For a two-class problem, an example metric is the binomial log-likelihood statistic. To illustrate this statistic, let \(i\) represent the index of the samples where \(i=1, 2, \ldots, n\), and let \(j\) represent the numeric value of the number of outcome classes where \(j=1, 2\)18. Next, we will use \(y_{ij}\) to represent the indicator of the true class of the \(i^{th}\) sample. That is, \(y_{ij} = 1\) if the \(i^{th}\) sample is in the \(j^{th}\) class and 0 otherwise. Finally, let \(p_{ij}\) represent the predicted probability of the \(i^{th}\) sample in the \(j^{th}\) class. Then the log-likelihood is calculated as

\[ \log \ell = \sum_{i=1}^n \sum_{j=1}^C y_{ij} \log(p_{ij}), \]

where \(C\) = 2 for the two-class problem. In general, we want to maximize the log-likelihood. This value will be maximized if all samples are predicted with high probability to be in the correct class.

Two other metrics that are commonly computed on class probabilities are the Gini criterion (Breiman et al. 1984)

\[ G = \sum_{i=1}^n \sum_{j \ne j'} p_{ij} p_{ij'} \]

and entropy (\(H\)) (MacKay 2003):

\[ H = -\sum_{i=1}^n \sum_{j=1}^C p_{ij} \log_2p_{ij} \]

Unlike the log-likelihood statistic, both of these metrics are measures of variance or impurity in the class probabilities19 and should be minimized.

Table 3.2: A comparison of typical probability-based measures used for classification models. The calculations presented here assume that Class 1 is the true class. The “Good” model has the highest log-likelihood and the lowest Gini and Entropy. However, Gini and Entropy have the same values for the “Good” and “Bad” model, which illustrates a problem with these metrics.
Probabilities
Statistics
Class 1 Class 2 Log-Likelihood Gini Entropy
Equivocal Model 0.5 0.5 -0.693 0.25 1.000
Good Model 0.8 0.2 -0.223 0.16 0.722
Bad Model 0.2 0.8 -1.609 0.16 0.722

Of these three metrics, it is important to note that the likelihood statistic is the only one to use the true class information. Because of this, it penalizes poor models in a supervised manner. The Gini and entropy statistics would only penalize models that are equivocal (i.e. produce roughly equal class probabilities). For example, Table 3.2 shows a two-class example. If the true outcome was the first class, the model results shown in the second row would be best. The likelihood statistic only takes into account the column called “Class 1” since that is the only column where \(y_{ij} = 1\). In terms of the likelihood statistic, the equivocal model does better than the model that confidently predicts the wrong class. When considering Gini and entropy, the equivocal model does worst while the good and bad model are equivalent20.

The class probability distributions for the OkCupid data. The top panel corresponds to the true STEM profiles while the bottom shows the probabilities for the non-STEM test set samples.

Fig. 3.2: The class probability distributions for the OkCupid data. The top panel corresponds to the true STEM profiles while the bottom shows the probabilities for the non-STEM test set samples.

When there are two classes, one advantage that the log-likelihood has over metrics based on a hard prediction is that it sidesteps the issue of the appropriateness of the probability cutoff. For example, when discussing accuracy, sensitivity, specificity, and other measures there is the implicit assumption that the probability cutoff used to go between a soft prediction to a hard prediction is valid. This can often not be the case, especially when the data have a severe class imbalance21. Consider the OkCupid data and the logistic regression model that was previously discussed. The class probability estimates that were used to make definitive predictions that were contained in Table 3.1 are shown in Figure 3.2 where the top panel contains the profiles that were truly STEM and the bottom panel has the class probability distribution for the other profiles22. The common 50% cutoff was used to create the original table of observed by predicted classes. Table 3.1 can also be visualized using a mosaic plot such as the one shown in Figure 3.3(b) where the size of the blocks are proportional to the amount of data in each cell. What would happen to this table if we were more permissive about the level of evidence needed to call a profile STEM? Instead of using a 50% cutoff, we might lower the threshold for the event to 20%. In this instance, we would call more profiles as being STEM overall. This might raise sensitivity since the true STEM profiles are more likely to be correctly predicted, but the cost is to increase the number of false positives. The mosaic plot for this confusion matrix is shown in Figure 3.3(a) where the blue block in the upper left becomes larger but there is also an increase in the red block in the lower right. In doing so, we increase the sensitivity from 0.73 to 0.95 but at the cost of specificity dropping from 0.71 to 0.29. Increasing the level of evidence needed to predict a STEM profile to 80%, has the opposite effect as shown in Figure 3.3(c). Here, specificity improves but sensitivity is undermined.

Mosaic plots of a confusion matrix under different cutoffs.

Fig. 3.3: Mosaic plots of a confusion matrix under different cutoffs.

The question then becomes “what probability cutoff should be used?” This depends on a number of things, including which error (false positive or false negative) hurts the most. However, if both types of errors are equally bad, there may be cutoffs that do better than the default.

The receiver operating characteristic (ROC) (DG Altman and Bland 1994a) curve can be used to alleviate this issue. It considers all possible cutoffs and tracks the changes in sensitivity and specificity. The curve is composed by plotting the false positive rate (1 - specificity) versus the true positive rate. The ROC curve for the OkCupid data is shown in Figure 3.4(a). The best model is one that hugs the y-axis and directly proceeds to the upper left corner (where neither type of error is made) while a completely ineffective model’s curve would track along the diagonal line shown in grey. This curve allows the user to do two important tasks. First, an appropriate cutoff can be determined based on one’s expectations regarding the importance of either sensitivity or specificity. This cutoff can then be used to make the qualitative predictions. Secondly, and perhaps more importantly, it allows a model to be assessed without having to identify the best cutoff. Commonly, the area under the ROC curve is used to evaluate models. If the best model immediately proceeds to the upper left corner, the area under this curve would be one while the poor model would produce an AUC in the neighborhood of 0.50. Caution should be used though since two curves for two different models may cross; this indicates that there are areas where one model does better than the other. Used as a summary measure, the AUC annihilates any subtleties that can be seen in the curves. For the curve in Figure 3.4(a), the AUC was 0.79, indicating a moderately good fit.

From the information retrieval point of view, the precision-recall curve is more appropriate (Christopher, Prabhakar, and Hinrich 2008). This is similar to the ROC curve in that the two statistics are calculated over every possible cutoff in the data. For the OkCupid data, the curve is shown in Figure 3.4(b). A poor model would result in a precision-recall curve that is in the vicinity of the horizontal grey line that is at the value of the observed prevalence (0.18 here). The area under the curve is used again here to summarize and the best possible value is again 1.0. The area under this curve is 0.482.

An ROC curve (a) and a precision-recall curve (b).

Fig. 3.4: An ROC curve (a) and a precision-recall curve (b).

During the initial phase of model building, a good strategy for data sets with two classes is to focus on the AUC statistics from these curves instead of metrics based on hard class predictions. Once a reasonable model is found, the ROC or precision-recall curves can be carefully examined to find a reasonable cutoff for the data and then qualitative prediction metrics can be used.

3.2.3 Context-Specific Metrics

While the metrics discussed previously can be used to develop effective models, they may not answer the underlying question of interest. As an example, consider a scenario where we have collected data on customer characteristics and whether or not the customers clicked on an ad. Our goal may be to relate customer characteristics to the probability of a customer clicking on an ad. Several of the metrics described above would enable us to assess model performace if this was the goal. Alternatively, we may be more interested in answering “how much money will my company make if this model is used to predict who will click on an ad?” In another context, we may be interested in building a model to answer the question “what is my expected profit when the model is used to determine if this customer will repay a loan?”. These questions are very context specific and do not directly fit into the previously described metrics.

Take the loan example. If a loan is requested for \(\$\)M, can we compute the expected profit (or loss)? Let’s assume that our model is created on an appropriate data set and can produce a class probability \(P_r\) that the loan will be paid on time. Given these quantities, the interest rate, fees, and other known factors, the gross return on the loan can be computed for each data point and the average can then be used to optimize the model.

Therefore, we should let the question of interest lead us to an appropriate metric for assessing a model’s ability to answer the question. We may be able to use common, existing metrics. Or we may need to develop custom metrics for our context. See Chapter 16 of Kuhn and Johnson (2013) for additional discussion.

3.3 Data Splitting

One of the first decisions to make when starting a modeling project is how to utilize the existing data. One common technique is to split the data into two groups typically referred to as the training and testing sets23. The training set is used to develop models and feature sets; they are the substrate for estimating parameters, comparing models, and all of the other activities required to reach a final model. The test set is used only at the conclusion of these activities for estimating a final, unbiased assessment of the model’s performance. It is critical that the test set not be used prior to this point. Looking at its results will bias the outcomes since the testing data will have become part of the model development process.

How much data should be set aside for testing? It is extremely difficult to make a uniform guideline. The proportion of data can be driven by many factors, including the size of the original pool of samples and the total number of predictors. With a large pool of samples, the criticality of this decision is reduced once “enough” samples are included in the training set. Also, in this case, alternatives to a simple initial split of the data might be a good idea; see Section 3.4.6 below for additional details. The ratio of the number of samples (\(n\)) to the number of predictors (\(p\)) is important to consider, too. We will have much more flexibility in splitting the data when \(n\) is much greater than \(p\). However when \(n\) is less than \(p\), then we can run into modeling difficulties even if \(n\) is seemingly large.

There are a number of ways to split the data into training and testing sets. The most common approach is to use some version of random sampling. Completely random sampling is a straightforward strategy to implement. However this approach can be problematic when the response is not evenly distributed across the outcome. A less risky splitting strategy would be to use a stratified random sample based on the outcome. For classification models, this is accomplished by selecting samples at random within each class. This approach ensures that the frequency distribution of the outcome is approximately equal within the training and test sets. When the outcome is numeric, artificial strata can be constructed based on the quartiles of the data. For example, in the Ames housing price data, the quartiles of the outcome distribution would break the data into four artificial groups containing roughly 230 houses. The training/test split would then be conducted within these four groups and the four different training set portions are pooled together (and the same for the test set).

Non-random sampling can also be used when there is a good reason. One such case would be when there is an important temporal aspect to the data. Here it may be prudent to use the most recent data as the test set. This is the approach used in the Chicago transit data discussed in Section 4.1.

3.4 Resampling

As previously discussed, there are times where there is the need to understand the effectiveness of the model without resorting to the test set. Simply repredicting the training set is problematic so a procedure is needed to get an appraisal using the training set. Resampling methods will be used for this purpose.

Resampling methods can generate different versions of our training set that can be used to simulate how well models would perform on new data. These techniques differ in terms of how the resampled version of the data is created and how many iterations of the simulation process is conducted. In each case, a resampling scheme generates a subset of the data to be used for modeling and another that is used for measuring performance. Here, we will refer to the former as the “analysis set” and the latter as the “assessment set”. They are roughly analogous to the training and test sets described at the beginning of the chapter24. A graphic of an example data hierarchy with three resamples is shown in Figure 3.5.

Fig. 3.5: A diagram of typical data usage with three resamples of the training data.

There are a number of different flavors of resampling that will be described in the next four sections.

3.4.1 V-Fold Cross-Validation and Its Variants

Simple V-fold cross-validation creates V different versions of the original training set that have the same approximate size. Each of the V assessment sets contains 1/V of the training set and each of these exclude different data points. The analysis sets contain the remainder (typically called the “folds”) . Suppose V = 10, then there are 10 different versions of 90% of the data and also 10 versions of the remaining 10% for each corresponding resample.

To use V-fold cross-validation, a model is created on the first fold and the corresponding assessment set is predicted by the model. The assessment set is summarized using the chosen performance measures (e.g. RMSE, the area under the ROC curve, etc.) and these statistics are saved. This process proceeds in a round-robin fashion so that, in the end, there are V estimates of performance for the model and each was calculated on a different assessment set. The cross-validation estimate of performance is computed by averaging the V individual metrics.

When the outcome is categorical, stratified splitting techniques can also be applied here to make sure that the analysis and assessment sets produce the same frequency distribution of the outcome. Again, this is a good idea when the classes are imbalanced but is unlikely to be problematic otherwise.

For example, for the OkCupid data, stratified 10-fold cross-validation was used. The training set consists of 38,809 profiles and each of the 10 assessment sets contains 3,881 different profiles. The area under the ROC curve was used to measure performance of the logistic regression model previously mentioned. The 10 areas under the curve ranged from 0.778 to 0.804 and their average value was 0.789. Without using the test set, we can use this statistic to forecast how this model would perform on new data.

As will be discussed in Section 3.4.5, resampling methods have different characteristics. One downside to basic V-fold cross-validation is that it is relatively noisier than other resampling schemes. One way to compensate for this is to conduct repeated V-fold cross-validation. If R repeats are used, V resamples are created R separate times and, in the end, RV resamples are averaged to estimate performance. Since more data are being averaged, the reduction in the variance of the final average would decease by \(\sqrt{R}\) (using a Gaussian approximation25). Again, the noisiness of this procedure is relative and, as one might expect, is driven by the amount of data in the assessment set. For the OkCupid data, the area under the ROC curve was computed from 3,881 profiles and is likely to yield sufficiently precise estimates (even if we only expect about 716 of them to be STEM profiles).

The assessment sets can be used for model validation and diagnosis. Table 3.1 and Figure 3.2 use these holdout predictions to visualize model performance. Also, Section 4.4 has a more extensive description of how the assessment datasets can be used to drive improvements to models.

One other variation, leave-one-out cross-validation, has V equal to the size of the training set. This is a somewhat deprecated technique and may only be useful when the training set size is extremely small (Shao 1993).

Figure 3.6 shows a diagram of 10-fold cross-validation for a hypothetical data set with 20 training set samples. For each resample, two different training set data points are held out for the assessment set. Note that each of the assessment sets are mutually exclusive and contains different instances.

3.4.2 Monte Carlo Cross-Validation

V-fold cross-validation produced V sets of splits with mutually exclusive assessment sets. Monte Carlo resampling produces splits that are likely to contain overlap. For each resample, a random sample is taken with \(\pi\) proportion of the training set going into the analysis set and the remaining samples allocated to the assessment set. Like the previous procedure, a model is created on the analysis set and the assessment set is used to evaluate the model. This splitting procedure is conducted B times and the average of the B results are used to estimate future performance. B is chosen to be large enough so that the average of the B values has an acceptable amount of precision.

A diagram of two types of cross-validation for a training set containing 20 samples.

Fig. 3.6: A diagram of two types of cross-validation for a training set containing 20 samples.

Figure 3.6 also shows Monte Carlo fold cross-validation with 10 resamples and \(\pi = 0.90\). Note that, unlike 10-fold cross-validation, some of the same data points are used in different assessment sets.

3.4.3 The Bootstrap

A bootstrap resample of the data is defined to be a simple random sample that is the same size as the training set where the data are sampled with replacement (Davison and Hinkley 1997) This means that when a bootstrap resample is created there is a 63.2% chance that any training set member is included in the bootstrap sample more than once. The bootstrap resample is used as the analysis set and the assessment set, sometimes known as the out-of-bag sample, consists of the members of the training set not included in the bootstrap sample. As before, bootstrap sampling is conducted B times and the same modeling/evaluation procedure is followed to produce a bootstrap estimate of performance that is the mean of B results.

A diagram of bootstrap resampling  for a training set containing 20 samples. The colors represent how many times a data point is replicated in the analysis set.

Fig. 3.7: A diagram of bootstrap resampling for a training set containing 20 samples. The colors represent how many times a data point is replicated in the analysis set.

Figure 3.7 shows an illustration of ten bootstrap samples created from a 20 sample data set. The colors show that several training set points are selected multiple times for the analysis set. The assessment set would consist of the rows that have no color.

3.4.4 Rolling Origin Forecasting

This procedure is specific to time-series data or any data set with a strong temporal component (Hyndman and Athanasopoulos 2013). If there are seasonal or other chronic trends in the data, random splitting of data between the analysis and assessment sets may disrupt the model’s ability to estimate these patterns

In this scheme, the first analysis set consists of the first M training set points, assuming that the training set is ordered by time or other ordered temporal component. The assessment set would consist of the next N training set samples. The second resample keeps the data set sizes the same but starts the splitting process at the second training set member. The splitting scheme proceeds until there is no more data to produce the same data set sizes. Supposing that this results in B splits of the data, the same process is used for modeling and evaluation and, in the end, there are B estimates of performance generated from each of the assessment sets. A simple method for estimating performance is to again use simple averaging of these data. However, it should be understood that, since there is significant overlap in the rolling assessment sets, the B samples themselves constitute a time-series and might also display seasonal or other temporal effects.

A diagram of a rolling origin forcasting resampling.

Fig. 3.8: A diagram of a rolling origin forcasting resampling.

Figure 3.8 shows this type of resampling where 10 data points are used for analysis and the subsequent two training set samples are used for assessment.

There are a number of variations of the procedure:

  • The analysis set need not be the same size. It can cumulatively grow as the moving window proceeds along the training set. In other words, the first analysis set would contain M data points, the second would contain M + 1 and so on. This is the approach taken with the Chicago train data modeling and is described in Chapter 4.
  • The splitting procedure could skip iterations to produce less resamples. For example in the Chicago data, there are daily measurements from 2001 to 2016. Incrementing by one day would produce an excessive value of B. For these data, 13 samples were skipped so that the splitting window moves in two week blocks instead of by individual day.
  • If the training data are unevenly sampled, the same procedure can be used but moves over time increments rather than data set row increments. For example, the window could move over 12 hour periods for the analysis sets and 2 hour periods for the assessment sets.

This resampling method differs from the previous ones in at least two ways. The splits are not random and the assessment data set is not the remainder of the training set data once the analysis set was removed.

3.4.5 Variance and Bias in Resampling

In Section 1.2.5, variance and bias properties of models were discussed. Resampling methods have the same properties but their effects manifest in different ways. Variance is more straightforward to conceptualize. If you were to conduct 10-fold cross-validation many times on the same data set, the variation in the resampling scheme could be measured by determining the spread of the resulting averages. This variation could be compared to a different scheme that was repeated the same number of times (and so on) to get relative comparisons of the amount of noise in each scheme.

Bias is the ability of a particular resampling scheme to be able to hit the true underlying performance parameter (that we will never truly know). Generally speaking, as the amount of data in the analysis set shrinks, the resampling estimate’s bias increases. In other words, the bias in 10-fold cross-validation is smaller than the bias in 5-fold cross-validation. For Monte Carlo resampling, this obviously depends on the value of \(\pi\). However, through simulations, one can see that 10-fold cross-validation has less bias than Monte Carlo cross-validation when \(\pi = 0.10\) and B = 10 are used. Leave-one-out cross-validation would have very low bias since its analysis set is only one sample apart from the training set.

Figure 3.9 contains a graphical representation of variance and bias in resampling schemes where the curves represent the distribution of the resampling statistics if the same procedure was conducted on the same data set many times. Four possible variance/bias cases are represented. We will assume that the model metric being measured here is better when the value is large (such as \(R^2\) or sensitivity) and that the true value represented by the green vertical line. The upper right panel demonstrates a pessimistic bias since the values tend to be smaller than the true value while the panel below in the lower right shows a resampling scheme that has relatively low variance and the center of its distribution is on target with the true value (so that it nearly unbiased).

In general, for a fixed training set size and number of resamples, simple V-fold cross-validation is generally believed to be the noisiest of the methods discussed here and the bootstrap is the least variable26. The bootstrap is understood to be the most biased (since about 36.8% of the training set is selected for assessment) and its bias is generally pessimistic (i.e. likely to show worse model performance than the true underlying value). There have been a few attempts at correcting the bootstrap’s bias such as Efron (1983) and Efron and Tibshirani (1997).

While V-fold cross-validation does have inflated variance, its bias is fairly low when V is 10 or more. When the training set is not large, we recommend using five or so repeats of 10-fold cross-validation, depending on the required precision, the training set size, and other factors.

An illustration of the variance-bias differences in resampling schemes.

Fig. 3.9: An illustration of the variance-bias differences in resampling schemes.

3.4.6 What Should Be Included Inside of Resampling?

In the preceding descriptions of resampling schemes, we have said that the assessment set is used to “build the model”. This is somewhat of a simplification. In order for any resampling scheme to produce performance estimates that generalize to new data, it must contain all the steps in the modeling process that could significantly affect the model’s effectiveness. For example, in Section 1.1, a transformation procedure was used to modify the predictors variables and this resulted in an improvement in performance. During resampling, this step should be included in the resampling loop. Other preprocessing or filtering steps (such as PCA signal extraction, predictor correlation filters, feature selection methods) must be part of the resampling process in order to understand how well the model is doing and to measure when the modeling process begins to overfit.

There are some operations that can be exempted. For example, in Chapter 2, only a handful of patients had missing values and these were imputed using the median. For such a small modification, we did not include these steps inside of resampling. In general though, imputation can have a substantial impact on performance and its variability should be propagated into the resample results. Centering and scaling can also be exempted from resampling, all other things being equal.

As another example, the OkCupid training data were downsampled so that the class proportions were equal. This is a substantial data processing step and it is important to propagate the effects of this procedure through the resampling results. For this reason, the downsampling procedure is executed on the analysis set of every resample and then again on the entire training set when the final model is created.

One other aspect of resampling is related to the concept of information leakage which is where the test set data are used (directly or indirectly) during the training process. This can lead to overly optimistic results that do not replicate on future data points and can occur in subtle ways.

For example, suppose a data set of 120 samples has a single predictor and is sequential in nature, such as a time series. If the predictor is noisy, one method for preprocessing the data is to apply a moving average to the predictor data and replace the original data with the smoothed values. Suppose a 3-point average is used and that the first and last data points retain their original values. Our inclination would be to smooth the whole sequence of 120 data points with the moving average, split the data into training and test sets (say with the first 100 rows in training). The subtle issue here is that the 100th data point, the last in the training set, uses the first in the test set to compute its 3-point average. Often the most recent data are most related to the next value in the time series. Therefor, including the most recent point will likely bias model performance.

To provide a solid methodology, one should constrain themselves to developing the list of preprocessing techniques, estimate them only in the presence of the training data points, and then apply the techniques to future data (including the test set). Arguably, the moving average issue cited above is most likely minor in terms of consequences, but illustrates how easily the test data can creep into the modeling process. The approach to applying this preprocessing technique would be to split the data then apply the moving average smoothers to the training and test sets independently.

Another, more overt path to information leakage, can sometimes be seen in machine learning competitions where the training and test set data are given at the same time. While the test set data often have the outcome data blinded, it is possible to “train to the test” by only using the training set samples that are most similar to the test set data. This may very well improve the model’s performance scores for this particular test set but might ruin the model for predicting on a broader data set.

Finally, with large amounts of data, alternative data usage schemes might be a good idea. Instead of a simple training/test split, multiple splits can be created for specific purposes. For example, a specific split of the data can be used to determine the relevant predictors for the model prior to model building using the training set. This would reduce the need to include the expensive feature selection steps inside of resampling. The same approach could be used for post-processing activities, such as determining an appropriate probability cutoff from a receiver operating characteristic curve.

3.5 Tuning Parameters and Overfitting

Many models include parameters that, while important, cannot be directly estimated from the data. The tuning parameters (sometimes called hyperparameters27) are important since they often control the complexity of the model and thus also affect any variance-base trade-off that can be made.

As an example, the K-nearest neighbor model, stores the training set data and, when predicting new samples, locates the K training set points that are in the closest proximity to the new sample. Using the training set outcomes for the neighbors, a prediction is made. The number of neighbors controls the complexity and, in a manner very similar to the moving average discussion in Section 1.2.5, controls the variance and bias of the models. When K is very small, there is the most potential for overfitting since only few values are used for prediction and are most susceptible to changes in the data. However, if K is too large, too many potentially irrelevant data points are used for prediction resulting in an underfit model.

Fig. 3.10: A house in Ames in the test set (blue) along with its five closest neighbors from the training set (red). The housing prices from the training set are used to estimate the price of the test set house.

To illustrate this, consider Figure 3.10 where a single test set sample is shown as the blue circle and its five closest (geographic) neighbors from the training set are shown in red. The test set sample’s sale price is $282.5K and the neighbor’s prices, from closest to farthest, are: $320.0K, $248.5K, $286.5K, $274.0K, $320.0K. Using K = 1, the model would miss the true house price by $37.5K. This illustrates the concept of overfitting introduced in Section 1.2.1; the model is too aggressively using patterns in the training set to make predictions on new data points. For this model, increasing the number of neighbors might help alleviate the issue. Averaging all K = 5 points to make a prediction substantially cuts the error to $7.3K.

This illustrates, for this model, the effect that the tuning parameter can have on the quality of the models. In some models, there could be more than one parameter. Again for the nearest neighbor model, a different distance metric could be used as well as difference schemes for weighting the neighbors so that more distant points have less of an effect on the prediction.

To make sure that proper values of the tuning parameters are used, some sort of search procedure is required along with a method for obtaining good, generalizable measures of performance. For the latter, repeatedly using the test set for these questions is problematic since it would lose its impartiality. Instead, resampling is commonly used. The next section will describe a few methods for determining optimal values of these types of parameters.

3.6 Model Optimization and Tuning

The search for the best tuning parameter values can be done in many ways but most fall into two main categories: those that predefine which values to evaluate and those that incrementally determine the values. In the first case, the most well known procedure is grid search. Here, a set of candidate tuning parameter values are specified and then evaluated. In some cases, the model will have more than one tuning parameter and, in this case, a candidate parameter combination is multidimensional. We recommend using resampling to evaluate each distinct parameter value combination to get good estimates of how well each candidate performs. Once the results are calculated, the “best” tuning parameter combination is chosen and the final model is fit to the entire training set with this value. The best combination can be determined in various ways but the most common approach is to pick the candidate with the statistically best results.

As an example, a simple \(K\)-nearest neighbors model requires the number of neighbors. For the OkCupid data, this model will be used to predict the profession of the profile. In this context, the predictors contain many different types of profile characteristics, so that a “nearest neighbor” is really a similar profile based on many characteristics.

We will predefine the candidate set to be \(K = 1, 3, \ldots 201\). When combined with the same 10-fold cross-validation process, a total of 380 temporary models will be used to determine a good value of \(K\). Once the best value of K is chosen, one final model is created using the optimal number of neighbors. The resampling profile is shown in Figure 3.11. Each black point shown in this graph is the average of performance for ten different models estimated using a distinct 90% of the training set. The configuration with the largest area under the ROC curve used 197 neighbors with a corresponding AUC of 0.757. Figure 3.11 shows the individual resampling profiles for each of the ten resamples. The reduced amount of variation in these data is mostly due to the size of the training set28.

The resampling profile generated by a simple grid search on the number of neighbors in a $K$-NN classification model for the OkC data.  The black line represents the averaged performance across 10 resamples while the other lines represent the performance across each individual resample.  Averaging the individual reample performance values reduces the variation in the profile.

Fig. 3.11: The resampling profile generated by a simple grid search on the number of neighbors in a \(K\)-NN classification model for the OkC data. The black line represents the averaged performance across 10 resamples while the other lines represent the performance across each individual resample. Averaging the individual reample performance values reduces the variation in the profile.

When there are many tuning parameters associated with a model, there are several ways to proceed. First, a multidimensional grid search can be conducted where candidate parameter combinations and the grid of combinations are evaluated. In some cases, this can be very inefficient. Another approach is to define a range of possible values for each parameter and to randomly sample the multidimensional space enough times to cover a reasonable amount (Bergstra and Bengio 2012). This random search grid can then be resampled in the same way as a more traditional grid. This procedure can be very beneficial when there are a large number of tuning parameters and there is no a priori notion of which values should be used. A large grid may be inefficient to search, especially if the profile has a fairly stable pattern with little change over some range of the parameter. Neural networks, gradient boosting machines, and other models can effectively be tuned using this approach29.

To illustrate this procedure, the OkCupid data was used once again. A single layer, feed-forward neural network was used to model the probability of being in the STEM field using the same predictors as the previous two models. This model is an extremely flexible nonlinear classification system with many tuning parameters. See I, Bengio, and Courville (2016) for an excellent primer on neural networks and deep learning models.

The main tuning parameters for the model are:

  • The number of hidden units. This parameter controls the complexity of the neural network. Larger values enable higher performance but also increase the risk of overfitting. For these data, the number of units in the hidden layers were randomly selected to be between 2 and 20.
  • The activation function. The nonlinear function set by this parameter links different parts of the network. Three different functions were used: traditional sigmoidal curves, tanh, or rectified linear unit functions (ReLU).
  • The dropout rate. This is the rate at which coefficients are randomly set to zero during and is most likely to attenuate overfitting (Srivastava et al. 2014). Rates between 0 and 80% were considered.

The fitting procedure for neural network coefficients can be very numerically challenging. There are usually a large number of coefficients to estimate and there is a significant risk of finding a local optima. Here, we use a gradient-based optimization method called RMSProp30 to fit the model. This is a modern algorithm for finding coefficient values and there are several model tuning parameters for this procedure31:

  • The batch size controls how many of the training set data points are randomly exposed to the optimization process at each iteration. This has the effect of reducing potential overfitting by providing some randomness to the optimization process. Batch sizes between 10 to 40K were considered.
  • The learning rate parameter controls the rate of decent during the parameter estimation iterations and these values were contrasted to be between zero and one.
  • A decay rate that decreases the learning rate over time (ranging between zero and one).
  • The root mean square gradient scaling factor (\(\rho\)) controls how much the gradient is normalized by recent values of the squared gradient values. Smaller values of this parameter give more emphasis to recent gradients. The range of this parameter was set to be [0.0, 1.0].

For this model, 10 different seven dimensional tuning parameter combinations were created randomly using uniform distributions to sample within the ranges above. Each of these settings were evaluated using the same 10-fold cross-validation splits used previously.

The resampled ROC values significantly varied between the candidate parameter values. The best setting is italicized in Table 3.3 which had a corresponding area under the ROC curve of 0.785.

Table 3.3: The settings and results for random search of the neural network parameter space.
Units Dropout Batch Size Learn Rate Grad. Scaling Decay Act. Fun. ROC
7 0.3368 11348 0.00385 0.55811 1.16e-04 sigmoid 0.785
5 0.4023 36070 0.04142 0.95232 3.84e-02 sigmoid 0.781
6 0.4720 38651 0.02634 0.80618 4.94e-05 sigmoid 0.779
12 0.6619 38206 0.25837 0.63325 3.09e-02 sigmoid 0.777
7 0.3918 17235 0.01681 0.36270 2.90e-04 tanh 0.773
10 0.4979 19103 0.04818 0.83560 1.92e-03 relu 0.772
3 0.1190 16369 0.22210 0.22683 4.02e-02 relu 0.769
2 0.3797 22255 0.10597 0.93841 4.27e-05 sigmoid 0.768
7 0.6139 38198 0.30864 0.68575 1.67e-03 sigmoid 0.756
4 0.0694 18167 0.45844 0.94679 2.97e-03 tanh 0.751
20 0.0497 29465 0.40072 0.49598 7.07e-02 sigmoid 0.747
10 0.1466 33139 0.44443 0.72107 6.22e-05 tanh 0.736
17 0.1068 17953 0.43256 0.87800 1.99e-04 tanh 0.721
13 0.5570 13558 0.13159 0.20389 5.96e-05 relu 0.717
14 0.6279 33800 0.18082 0.33286 1.08e-05 tanh 0.697
11 0.5909 32417 0.61446 0.63142 3.08e-04 tanh 0.665
14 0.3866 30514 0.92724 0.38651 3.36e-03 tanh 0.664
10 0.4234 34655 0.49455 0.25216 3.05e-04 relu 0.630
7 0.6183 30606 0.82481 0.71944 2.61e-03 sigmoid 0.616
2 0.0903 33439 0.63991 0.00398 1.92e-03 tanh 0.597

As previously mentioned, grid search and random search methods have the tuning parameters specified in advance and the search does not adapt to look for novel values. There are other approaches that can be taken which do. For example, there are many nonlinear search methods such as the Nelder-Mead simplex search procedure, simulated annealing, and genetic algorithms that can be employed (Chong and Żak 2008) These methods conduct very thorough searches of the grid space but tend to be computationally expensive. One reason for this is that each evaluation of a new parameter requires a good estimate of performance to guide the search. Resampling is one of the best methods for doing this. Another approach to searching the space is called Bayesian optimization (Mockus 1994). Here, an initial pool of samples are evaluated using grid or random search. The optimization procedure creates a separate model to predict performance as a function of the tuning parameters and can then make a recommendation as to the next candidate set to evaluate. Once this new point is assessed, the model is updated and the process continues for a set number of iterations (Jones, Schonlau, and Welch 1998)32.

One final point about the interpretation of resampling results is that, by choosing the best settings based on the results and representing the model’s performance using these values, there is the risk of optimization bias. Depending on the problem, this bias might over-estimate the model’s true performance. There are nested resampling procedures that can be used to mitigate these biases. See Boulesteix and Strobl (2009) for more information.

3.7 Comparing Models Using the Training Set

When multiple models are in contention, there is often the need to have formal evaluations between them to understand if any differences in performance are above and beyond what one would expect at random. In our proposed workflows, resampling is heavily relied on to estimate model performance. It is good practice to use the same resamples across any models that are evaluated. That enables apples-to-apples comparisons between models. It also allows for formal comparisons to be made between models prior to the involvement of the test set. Consider logistic regression and neural networks models created for the OkCupid data. How do they compare? Since the two models used the same resamples to fit and assess the models, this leads to a set of 10 paired comparisons. Table 3.4 shows the specific ROC results per resample and their paired differences. The correlation between these two sets of values is 0.25, indicating that there is a likely to be a resample-to-resample effect in the results.

Table 3.4: Matched resampling results for two models for predicting the OkCupid data. The ROC metric was used to tune each model. Because each model uses the same resampling sets, we can formally compare the performance between the models.
ROC Estimates
Logistic Regression Neural Network Difference
Fold 1 0.798 0.774 -0.024
Fold 2 0.778 0.777 -0.001
Fold 3 0.790 0.793 0.003
Fold 4 0.795 0.798 0.003
Fold 5 0.797 0.780 -0.017
Fold 6 0.780 0.790 0.009
Fold 7 0.790 0.778 -0.012
Fold 8 0.784 0.774 -0.010
Fold 9 0.795 0.793 -0.002
Fold 10 0.796 0.795 -0.001

Given this set of paired differences, formal statistical inference can be done to compare models. A simple approach would be to consider a paired t-test between the two models or an ordinary one-sample t-test on the differences. The estimated difference in the ROC values is -0.005 with 95% confidence interval (-0.013, 0.002). There does not appear to be any evidence of a real performance difference between these models. This approach was previously used in Section 2.3 when the potential variables for predicting stroke outcome were ranked by their improvement in the area under the ROC above and beyond the null model. This comparative approach can be used to compare models or different approaches for the same model (e.g. preprocessing differences or feature sets).

The value in this technique is two-fold:

  1. It prevents the test set from being used during the model development process and
  2. Many evaluations of an external data set are used to assess the differences.

The second point is most important. By using multiple differences, the variability in the performance statistics can be measured. While a single static test set has its advantages, it is a single realization of performance for a model and we have no idea of the precision of this value.

More than two models can also be compared, although the analysis must account for the within-resample correlation using a Bayesian hierarchical model (McElreath 2016) or a repeated measures model (West, Welch, and Galecki 2014). The idea for this methodology originated with Hothorn et al. (2005). Benavoli et al. (2016) also provides a Bayesian approach to the analysis of resampling results between models and data sets.

3.8 Computing

R programs for reproducing these analysis can be found at https://github.com/topepo/TBD.

Feature Selection

Kim, Albert Y, and Adriana Escobedo-Land. 2015. “OkCupid Data for Introductory Statistics and Data Science Courses.” Journal of Statistics Education 23 (2):1–25.

Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.

Kvalseth, T. 1985. “Cautionary Note About \(R^2\).” American Statistician 39 (4):279–85.

Lawrence, I, and Kuei Lin. 1989. “A Concordance Correlation Coefficient to Evaluate Reproducibility.” Biometrics, 255–68.

Hampel, D.F., P.J. Andrews, F.R. Bickel, P.J. Rogers, W.H. Huber, and J.W. Turkey. 1972. Robust Estimates of Location. Princeton, New Jersey: Princeton University Press.

Rousseeuw, Peter J, and Christophe Croux. 1993. “Alternatives to the Median Absolute Deviation.” Journal of the American Statistical Association 88 (424):1273–83.

Agresti, Alan. 2012. Categorical Data Analysis. Wiley-Interscience.

Friendly, Michael, and David Meyer. 2015. Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data. CRC Press.

McElreath, R. 2016. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Racon Hall: Chapman; Hall.

Altman, DG, and JM Bland. 1994b. “Statistics Notes: Diagnostic tests 2: predictive values.” British Medical Journal 309 (6947):102.

Breiman, L., J. Friedman, R. Olshen, and C. Stone. 1984. Classification and Regression Trees. New York: Chapman; Hall.

MacKay, David JC. 2003. Information Theory, Inference and Learning Algorithms. Cambridge University Press.

Altman, DG, and JM Bland. 1994a. “Diagnostic tests 3: receiver operating characteristic plots.” BMJ: British Medical Journal 309 (6948):188.

Christopher, D Manning, Raghavan Prabhakar, and Schutze Hinrich. 2008. Introduction to Information Retrieval. Cambridge University Press.

Shao, Jun. 1993. “Linear Model Selection by Cross-Validation.” Journal of the American Statistical Association 88 (422):486–94.

Davison, Anthony Christopher, and David Victor Hinkley. 1997. Bootstrap Methods and Their Application. Cambridge University Press.

Hyndman, R, and G Athanasopoulos. 2013. Forecasting: Principles and Practice. OTexts.

Efron, B. 1983. “Estimating the error rate of a prediction rule: improvement on cross-validation.” Journal of the American Statistical Association, 316–31.

Efron, B, and R Tibshirani. 1997. “Improvements on cross-validation: The 632+ bootstrap method.” Journal of the American Statistical Association, 548–60.

Bergstra, James, and Yoshua Bengio. 2012. “Random Search for Hyper-Parameter Optimization.” Journal of Machine Learning Research 13:281–305.

I, Goodfellow., Y Bengio, and A Courville. 2016. Deep Learning. MIT Press.

Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research 15:1929–58.

Chong, Edwin K. P., and Stanislaw H. Żak. 2008. “Global Search Algorithms.” In An Introduction to Optimization, 267–95. John Wiley & Sons, Inc.

Mockus, Jonas. 1994. “Application of Bayesian Approach to Numerical Methods of Global and Stochastic Optimization.” Journal of Global Optimization 4 (4). Springer:347–65.

Jones, Donald R, Matthias Schonlau, and William J Welch. 1998. “Efficient Global Optimization of Expensive Black-Box Functions.” Journal of Global Optimization 13 (4). Springer:455–92.

Boulesteix, Anne-Laure, and Carolin Strobl. 2009. “Optimal Classifier Selection and Negative Bias in Error Rate Estimation: An Empirical Study on High-Dimensional Prediction.” BMC Medical Research Methodology 9 (1):85.

West, Brady T, Kathleen B Welch, and Andrzej T Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.

Hothorn, T, F Leisch, A Zeileis, and K Hornik. 2005. “The Design and analysis of benchmark experiments.” Journal of Computational and Graphical Statistics 14 (3):675–99.

Benavoli, Alessio, Giorgio Corani, Janez Demsar, and Marco Zaffalon. 2016. “Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis.” arXiv.org.


  1. While there have been instances where online dating information has been obtained without authorization, these data were made available with permission from OkCupid president and co-founder Christian Rudder. For more information, see the original publication. In these data, no user names or images were made available.

  2. github.com/rudeboybert/JSE_OkCupid

  3. Note that these values were not obtained by simply re-predicting the data set. The values in this table are the set of “assessment” sets generated during the cross-validation procedure defined in Section 3.4.1.

  4. Values close to -1 are rarely seen in predictive modeling since the models are seeking to find predicted values that are similar to the observed values. We have found that a predictive model that has difficulty finding a relationship between the predictors and the response has a kappa value slightly below or near 0

  5. The formulas that follow can be naturally extended to more than two classes, where \(C\) represents the total number of classes.

  6. In fact, the Gini statistic is equivalent to the binomial variance when there are two classes.

  7. While not helpful for comparing models, these two statistics are widely used in the process of creating decision trees. See Breiman et al. (1984) and Quinlan (1993) for examples. In that context, these metrics enable tree-based algorithms to create effective models.

  8. See Chapter 16 of Kuhn and Johnson (2013).

  9. As with the confusion matrix in Table 3.1, these data were created during 10-fold cross-validation.

  10. There are other types of subsets, such as a validation set, but these are not explored here.

  11. In fact, many people use the terms “training” and “testing” to describe the splits of the data produced during resampling. We avoid that here because 1) resampling is only ever conducted on the training set samples and 2) the terminology can confuse people since the same term is being used for different versions of the original data.

  12. This is based on the idea that the standard error of the mean has \(\sqrt(R)\) in the denominator.

  13. This is a general trend; there are many factors that affect these comparisons.

  14. Not to be confused with the hyperparameters of a prior distribution in Bayesian analysis.

  15. This is generally not the case in other data sets; visualizing the individual profiles can often show an excessive amount of noise from resample-to-resample that is averaged out in the end.

  16. However, a regular grid may be more efficient. For some models, there are optimizations that can be used to compute the results for a candidate parameter set that can be determined without refitting that models. The nature of random search cancels this benefit.

  17. RMSprop is an general optimization method that uses gradients. The details are beyond the scope of this book but more information can be found in Goodfellow, Bengio, and Courville (2016) and at https://en.wikipedia.org/wiki/Stochastic_gradient_descent#RMSProp.

  18. Many of these parameters are fairly arcane for those not well acquainted with modern derivative-based optimization. Chapter 8 of I, Bengio, and Courville (2016) has a substantial discussion of these methods.

  19. The GitHub repository http://bit.ly/2yjzB5V has example code for nonlinear search methods and Bayesian optimization of tuning parameters.