1.3 A More Complex Example

To illustrate the interplay between models and features we present a more realistic example11. The case study discussed here involves predicting the ridership on Chicago “L” trains (i.e., the number of people entering a particular station on a daily basis). If a sufficiently predictive model can be built, the Chicago Transit Authority could use this model to appropriately staff trains and number of cars required per line. This data set is discussed in more detail in Section 4.1 but this section describes a series of models that were evaluated when the data were originally analyzed.

Figure 1.7: A series of model and feature set combinations for modeling train ridership in Chicago.

To begin, a simple set of four predictors was considered. These initial predictors, labelled as “Set 1”, were developed because they are simple to calculate and visualizations showed strong relationships with ridership (the outcome). A variety of different models were evaluated and the root mean squared error (RMSE) was estimated using resampling methods. Figure 1.7 shows the results for several different types of models (e.g., tree-based models, linear models, etc). RMSE values for the initial feature set ranged between 2331 and 3248 daily rides12. With the same feature set, tree–based models had the best performance while linear models had the worst results. Additionally, there is very little variation in RMSE results within a model type (i.e., the linear model results tend to be similar to each other).

In an effort to improve model performance, some time was spent deriving a second set of predictors that might be used to augment the original group of four. From this, 128 numeric predictors were identified that were lagged versions of the ridership at different stations. For example, to predict the ridership one week in the future, today’s ridership would be used as a predictor (i.e. a seven day lag). This second set of predictors had an beneficial effect overall but were especially helpful to linear models (see the x-axis value of {1, 2} in Figure 1.7). However, the benefit varied between models and model types.

Since the lag variables were important for predicting the outcome, more lag variables were created using lags between 8 and 14 days. Many of these variables show a strong correlation to the other predictors. However, models with predictor sets 1, 2, and 3 did not show much meaningful improvement above and beyond the previous set of models and, for some, the results were worse. One particular linear model suffered since this expanded set had a high degree of between-variable correlation. This situation is generally known as multicollinearity and can be particularly troubling for some models. Because this expanded group of lagged variables didn’t now show much benefit overall, it was not considered further.

When brainstorming which predictors could be added next, it seemed reasonable to think that weather conditions might affect ridership. To evaluate this conjecture, a fourth set of 18 predictors was calculated and used in models with the first two sets (labeled as {1, 2, 4}). Like the third set, the weather did not show any relevance to predicting train ridership.

After conducting exploratory data analysis of residual plots associated with models with sets 1 and 2, a fifth set of 49 binary predictors were developed to address days where the current best models did poorly. These predictors resulted in a substantial drop in model error and were retained (see Figures 3.12 and 4.19). Note that the improvement affected models differently and that, with feature sets 1, 2, and 5, the simple linear models yielded results that are on par with more complex modeling techniques.

The overall points that should be understood from this demonstration are:

  1. When modeling data, there is almost never a single model fit or feature set that will immediately solve the problem. The process is more likely to be a campaign of trial and error to achieve the best results.
  2. The effect of feature sets can be much larger than the effect of different models.
  3. The interplay between models and features is complex and somewhat unpredictable.
  4. With the right set of predictors, is it common that many different types of models can achieve the same level of performance. Initially, the linear models had the worst performance but, in the end, showed some of the best performance.

Techniques for discovering, representing, adding, and subtracting are discussed in subsequent chapters.

  1. This example will be analyzed at length in later chapters.

  2. A RMSE value of 3000 can correspond to \(R^2\) values of between 0.80 and 0.90 in these data. However, as discussed in Section 3.2.1, \(R^2\) can be misleading here due to the nature of these data.