3.7 Comparing Models Using the Training Set

When multiple models are in contention, there is often the need to have formal evaluations between them to understand if any differences in performance are above and beyond what one would expect at random. In our proposed workflows, resampling is heavily relied on to estimate model performance. It is good practice to use the same resamples across any models that are evaluated. That enables apples-to-apples comparisons between models. It also allows for formal comparisons to be made between models prior to the involvement of the test set. Consider logistic regression and neural networks models created for the OkCupid data. How do they compare? Since the two models used the same resamples to fit and assess the models, this leads to a set of 10 paired comparisons. Table 3.4 shows the specific ROC results per resample and their paired differences. The correlation between these two sets of values is 0.97, indicating that there is a likely to be a resample-to-resample effect in the results.

Table 3.4: Matched resampling results for two models for predicting the OkCupid data. The ROC metric was used to tune each model. Because each model uses the same resampling sets, we can formally compare the performance between the models.
ROC Estimates
Logistic Regression Neural Network Difference
Fold 1 0.838 0.838 0.000
Fold 2 0.830 0.821 -0.008
Fold 3 0.847 0.846 -0.002
Fold 4 0.856 0.852 -0.003
Fold 5 0.836 0.830 -0.006
Fold 6 0.848 0.846 -0.002
Fold 7 0.838 0.836 -0.003
Fold 8 0.830 0.829 -0.002
Fold 9 0.843 0.839 -0.004
Fold 10 0.849 0.844 -0.005

Given this set of paired differences, formal statistical inference can be done to compare models. A simple approach would be to consider a paired t-test between the two models or an ordinary one-sample t-test on the differences. The estimated difference in the ROC values is -0.004 with 95% confidence interval (-0.005, -0.002). There does appear to evidence of a real (but negligible) performance difference between these models. This approach was previously used in Section 2.3 when the potential variables for predicting stroke outcome were ranked by their improvement in the area under the ROC above and beyond the null model. This approach can be used to compare models or different approaches for the same model (e.g. preprocessing differences or feature sets).

The value in this technique is two-fold:

  1. It prevents the test set from being used during the model development process and
  2. Many evaluations (via assessment sets) are used to gauge the differences.

The second point is more important. By using multiple differences, the variability in the performance statistics can be measured. While a single static test set has its advantages, it is a single realization of performance for a model and we have no idea of the precision of this value.

More than two models can also be compared, although the analysis must account for the within-resample correlation using a Bayesian hierarchical model (McElreath 2015) or a repeated measures model (West and Galecki 2014). The idea for this methodology originated with Hothorn et al. (2005). Benavoli et al. (2016) also provides a Bayesian approach to the analysis of resampling results between models and data sets.


McElreath, R. 2015. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman; Hall/CRC.

West, K, Band Welch, and A Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.

Hothorn, T, F Leisch, A Zeileis, and K Hornik. 2005. “The design and analysis of benchmark experiments.” Journal of Computational and Graphical Statistics 14 (3):675–99.

Benavoli, A, G Corani, J Demsar, and M Zaffalon. 2016. “Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis.” arXiv.org.