# 1 Introduction

Statistical models have gained importance as they have become ubiquitous in modern society. They enable us by generating various types of predictions in our daily lives. For example, doctors rely on general rules derived from models that tell them which specific cohorts of patients have an increased risk of a particular ailment or event. A numeric prediction of a flight’s arrival time can help understand if our airplane is likely to be delayed. In other cases, models are effective at telling us what is important or concrete. For example, a lawyer might utilize a statistical model to quantify the likelihood that potential hiring bias is occurring by chance or whether it is likely to be a systematic problem.

In each of these cases, models are created by taking existing data and finding a mathematical representation that has acceptable fidelity to the data. From such a model, important statistics can be estimated. In the case of airline delays, a prediction of the outcome (arrival time) is the quantity of interest. While the estimate of a possible hiring bias might be revealed through a specific model parameter. In the latter case, the hiring bias estimate is usually compared to the estimated uncertainty (i.e. noise) in the data and a determination is made based on how uncommon such a result would be relative to the noise - a concept usually referred to as “statistical significance.” This type of model is generally thought of as being *inferential*: a conclusion is reached for the purpose of understanding the state of nature. In contrast, the prediction of a particular value (such as arrival time) reflects an *estimation problem* where our goal is not necessarily to understand if a trend or fact is genuine but is focused on having the most accurate determination of that value. The uncertainty in the prediction is another important quantity, especially to gauge the trustworthiness of the value generated by the model.

Whether the model will be used for inference or estimation (or in rare occasions, both), there are important characteristics to consider. *Parsimony* (or simplicity) is a key consideration. Simple models are generally preferable to complex models, especially when inference is the goal. For example, it is easier to specify realistic distributional assumptions in models with fewer parameters. Parsimony also leads to a higher capacity to interpret a model. For example, an economist might be interested in quantifying the benefit of postgraduate education on salaries. A simple model might represent this relationship between years of education and job salary linearly. This parameterization would easily facilitate statistical inferences on the potential benefit of such education. But suppose that the relationship differs substantially between occupations and/or is not linear. A more complex model would do a better job at capturing the data patterns but would be much more less interpretable.

The problem, however, is that **accuracy should not be seriously sacrificed for the sake of simplicity**. A simple model might be easy to interpret but would not succeed if it does not maintain acceptable level of faithfulness to the data; if a model is only 50% accurate, should it be used to make inferences or predictions? Complexity is usually the solution to poor accuracy. By using additional parameters or by using a model that is inherently nonlinear, we might improve accuracy but interpretability will likely suffer greatly. This trade-off is a key consideration for model building.

Thus far the discussion has been focused on aspects of the model. However, the variables that go into the model (and how they are represented) are just as critical to success. It is impossible to talk about modeling without discussing models, but one of goal this book is increase the emphasis on the predictors in a model.

In terms of nomenclature, the quantity that is being modeled or predicted is referred to as either: the *outcome*, response, or dependent variable. The variables that are used to model the outcome are called the *predictors*, *features*, or independent variables (depending on the context). For example, when modeling the sale price of a house (the outcome), the characteristics of a property (e.g. square footage, number of bed rooms and bath rooms) could be used as predictors (the term features would also be suitable). However, consider artificial model terms that are composites of one or more variables, such as the number of bedrooms *per* bathroom. This type of variable might be more appropriately called a feature. In any case, features and predictors are used to explain the outcome in a model^{4}.

As one might expect, there are good and bad ways of entering predictors into a model. In many cases, there are multiple ways that an underlying piece of information can be represented or encoded. Consider a model for the sale price of a property. The location is likely to be crucial and can be represented in different ways. Figure 1.1 shows locations for properties in and around Ames Iowa, that were sold between 2006 and 2010. In this image, the colors represent the reported neighborhood of residence. There are 28 neighborhoods represented here and the number of properties per neighborhoods range from a single property in Landmark, to 443 in North Ames. A second representation of location in the data is longitude and latitude. A realtor might suggest using ZIP code as a predictor in the model as a proxy for school district since this can be an important consideration for buyers with children. But from an information theory point of view, longitude and latitude offer the most specificity for measuring physical location and one might make an argument that this representation has higher degree of information content (assuming that this particular information is predictive).

The idea that there are different ways to represent predictors in a model, and that some of these representations are better than others, leads to the idea of **feature engineering** - the process of creating representations of data that increase the effectiveness of a model.

Note that model effectiveness is influenced by many things. Obviously, if the predictor has no relationship to the outcome then its representation is irrelevant. However, it is very important to realize that there are a multitude of types of models and that each has its own sensitivities and needs. For example:

- Some models cannot tolerate predictors that measure the same underlying quantity (i.e. multicollinearity or correlation between predictors).
- Many models cannot use samples with any missing values.
- Some models are severely compromised when irrelevant predictors are in the data.

Feature engineering and variable selection can help mitigate many of these issues. ** The goal of this book is to help practitioners build better models by focusing on the predictors**. “Better” depends on the context of the problem but most likely involves the following factors: accuracy, simplicity, and robustness. To achieve these characteristics, or to make good trade-offs between them, it is critical to understand the interplay between predictors used in a model and the type of model. Accuracy and/or simplicity can sometimes be improved by representing data in ways that are more palatable to the model or by reducing the number of variables used. To demonstrate this point, a simple example with two predictors is shown in the next section. Additionally, a more substantial example is discussed in Section 1.3 that more closely resembles the modeling process in practice.

## 1.1 A Simple Example

As a simple example of how feature engineering can affect models, consider Figure 1.2a that shows a plot of two correlated predictor variables (labeled as *A* and *B*). The data points are colored by their outcome, a discrete variable with two possible values (“PS” and “WS”). These data originate from an experiment from Hill et al. (2007) which includes a larger predictor set. For their task, a model would require a high degree of accuracy but would not need to be used for inference. For this illustration, only these two predictors will be considered. In this figure, there is clearly a diagonal separation between the two classes. A simple logistic regression model (Hosmer and Lemeshow 2000) will be used here to create a prediction equation from these two variables. That model uses the following equation:

\[log(p/(1-p)) = \beta_0 + \beta_1 A + \beta_2 B\]

where *p* is the probability that a sample is the “PS” class and the \(\beta\) values are the model parameters that need to be estimated from the data.

A standard procedure (maximum likelihood estimation) is used to estimate the three regression parameters from the data. The authors used 1009 data points to estimate the parameters (i.e. a *training set*) and reserved 1010 samples strictly for estimating performance (a *test set*)^{5}. Using the training set, the parameters were estimated to be \(\hat{\beta_0} = 1.73\), \(\hat{\beta_1} = 0.003\), and \(\hat{\beta_2} = -0.064\).

To evaluate the model, predictions are made on the test set. Logistic regression naturally produces class probabilities that give an indication of likelihood for each class. While it is common to use a 50% cutoff to make hard class predictions, the performance derived from this default might be misleading. To avoid applying a probability cutoff, a technique called the receiver operating characteristic (ROC) curve is used here^{6}. The ROC curve evaluates the results on all possible cutoffs and plots the true positive rate versus the false positive rate. The curve for this example is shown in Figure 1.2b. The best possible curve is one that is shifted as close as possible to the upper left corner while a ineffective model will stay along the diagonal line shown in red. A common summary value for this technique is to use the area under the ROC curve where a value of 1.0 corresponds a perfect model while values near 0.5 are indicative of a model with no predictive ability. For the current logistic regression model, the area under the ROC curve is 0.794 which indicates moderate accuracy in classifying the response.

Given these two predictors variables, it would make sense to try different transformations and encodings of these data in an attempt to increase the area under the ROC curve. Since the predictors are both greater than zero and appear to have right-skewed distributions, one might be inclined to take the ratio *A/B* and enter only this term in the model. Alternatively, we could also evaluate if simple transformations of each predictor would be helpful. One method is the Box-Cox transformation which uses a separate estimation procedure prior to the logistic regression model that can put the predictors on a new scale. Using this methodology, the Box-Cox estimation procedure recommended that both predictors should be used on the inverse scale (i.e. 1/*A* instead of *A*). This representation of the data are shown in Figure 1.3a. When these transformed values were entered into the logistic regression model in lieu of the original values, the area under the ROC curve changed from 0.794 to 0.848, which is a substantial increase. Figure 1.3b shows both curves. In this case the ROC curve corresponding to the transformed predictors is uniformly better than the original result.

This example demonstrates how an alteration of the predictors, in this case a simple transformation, can lead to improvements to the effectiveness of the model. When comparing the data in Figures 1.2a and Figure 1.3a, it is easier to visually discriminate the two groups of data. In transforming the data individually, we enabled the logistic regression model to do a better job of separating the classes. Depending on the nature of the predictors, using the inverse of the original data might make inferential analysis more difficult.

However, different models have different requirements of the data. If the skewness of the original predictors was the issue affecting the logistic regression model, other models exist that do not have the same sensitivity to this characteristic. For example, a neural network can also be used to fit these data without using the inverse transformation of the predictors^{7}. This model was able to achieve an area under the ROC curve of 0.844 which is roughly equivalent to the improved logistic model results. One might conclude that the neural network model is inherently always better than logistic regression since neural network was not susceptible to the distributional aspects of the predictors. But we should not jump to a blanket conclusion like this due to the “No Free Lunch” theorem (see Section 1.2.3). Additionally, the neural network model has its own drawbacks: it is completely uninterpretable and requires extensive parameter tuning to achieve good results. Depending on how the model will be utilized, one of these models might be more favorable than the other for these data.

A more comprehensive summary of how different features can influence models in different ways is given shortly in Section 1.3. Before this example, the next section discussed several key concepts that will be used throughout this text.

## 1.2 Important Concepts

Before proceeding to specific strategies and methods there are some key concepts that should be discussed. These concepts involve theoretical aspects of modeling as well as the practice of creating a model. A number of these aspects are discussed here and additional details are provided in later chapters and references are given throughout this work.

### 1.2.1 Overfitting

Overfitting is the situation where a model fits very well to the current data but fails when predicting new samples. It typically occurs when the model has relied too heavily on patterns and trends in the current data set that do not occur otherwise. Since the model only has access to the current data set, it has no ability to understand that such patterns are anomalous. For example, in the housing data shown in Figure 1.1, one could determine that properties that had square footage between 1,267.5 and 1,277 and contained three bedrooms could have their sale prices predicted within $1,207 of the true values. However, the accuracy on other houses (not in this data set) that satisfy these conditions would be much worse. This is an example of a trend that does not *generalize* to new data.

Often, models that are very flexible (called “low bias models” in Section 1.2.5) have a higher likelihood of overfitting the data. It is not difficult for these models to do extremely well on the data set used to create the model and, without some preventative mechanism, can easily fail to generalize to new data. As will be seen in the coming chapters, especially Section 3.5, overfitting is one of the primary risks in modeling and should be a concern for practitioners.

While models can overfit to the *data points*, such as with the housing data shown above, feature selection techniques can overfit to the *predictors*. This occurs when a variable appears relevant in the current data set but shows no real relationship with the outcome once new data are collected. The risk of this type of overfitting is especially dangerous when the number of data points is small and the number potential predictors is very large. As with overfitting to the data points, this problem can be mitigated using a methodology that will show a warning when this is occurring.

### 1.2.2 Supervised and Unsupervised Procedures

Supervised data analysis involves identifying patterns between predictors and an identified *outcome* that is to be modeled or predicted, while unsupervised techniques are focused solely on identifying patterns among the predictors.

Both types of analyses would typically involve some amount of exploration. Exploratory data analysis (EDA) (Tukey 1977) is used understand the major characteristics of the predictors and outcome so that any particular challenges associated with the data can be discovered prior to modeling. This can include investigations of correlation structures in the variables, patterns of missing data, and/or anomalous motifs in the data that might challenge the initial expectations of the modeler.

Obviously, predictive models are strictly supervised since there is a direct focus on finding relationships between the predictors and the outcome. Unsupervised analyses include methods such as cluster analysis, principal component analysis, and similar tools for discovering patterns in data.

Both supervised and unsupervised analyses are susceptible to *overfitting* but supervised are particularly inclined to discovering erroneous patterns in the data for predicting the outcome. In short, we can use these techniques to create a *self fulfilling predictive prophecy*. For example, it is not uncommon for an analyst to conduct a supervised analysis of data to detect which predictors are significantly associated with the outcome. These predictors are then used in a visualization (such as a heat map or cluster analysis) on the same data but with only the significant predictors. Not surprisingly, the visualization reliably demonstrates that there are clear patterns between the outcomes and predictors and appears to provide evidence of their importance. However, since the same data are shown, the visualization is essentially *cherry picking* the results that are only true for this data and which are unlikely to generalize to new data.

### 1.2.3 No Free Lunch

The “No Free Lunch” Theorem (Wolpert 1996) is the idea that, without any specific knowledge of the problem or data at hand, no one predictive model can be said to be the best. There are many models that are optimized for some data characteristics (such as missing values or collinear predictors). In these situations, it might be reasonable to assume that they would do better than other models (all other things being equal). In practice, things are not so simple. One model that is optimized for collinear predictors might be constrained to model linear trends in the data and is sensitive to missingness in the data. It is very difficult to predict the best model especially before the data are in hand.

There have been experiments to judge which models tend to do better than others *on average*, notably Demsar (2006) and Fernandez-Delgado et al. (2014). These analyses show that some models have a tendency to produce the most accurate models but the rate of “winning” is not high enough to enact a strategy of “always use model *X*.”

In practice, it is wise to try a number of disparate types of models to probe which ones will work well with your particular data set.

### 1.2.4 The Model versus the Modeling Process

The process of developing an effective model is both iterative and heuristic. It is difficult to know the needs of any data set prior to working with it and it is common for many approaches to be evaluated and modified before a model can be finalized. Many books and resources solely focus on the modeling technique but this activity is often a small part of the overall process. Figure 1.4 shows an illustration of the overall process for creating a model for a typical problem.

The initial activity begins^{8} at marker (*a*) where exploratory data analysis is used to investigate the data. After initial explorations, marker (*b*) indicates where early data analysis might take place. This could include evaluating simple summary measures or identifying predictors that have strong correlations with the outcome. The process might iterate between visualization and analysis until the modeler feels confident that the data are well understood. At milestone (*c*), the first draft for how the predictors will be represented in the models is created based on the previous analysis.

At this point, several different modeling methods might be evaluated with the initial feature set. However, many models can contain *hyperparameters* that require tuning^{9}. This is represented at marker (*d*) where four clusters of models are shown as thin red marks. This represents four distinct models that are being evaluated but each one is evaluated multiple times over a set of candidate hyperparameter values. This model tuning process is discussed in Section 3.6 and is illustrated several times in later chapters. Once the four models have been tuned, they are numerically evaluated on the data to understand their performance characteristics (*e*). Summary measures for each model, such as model accuracy, are used to understand the level of difficulty for the problem and and to determine which models appear to best suit the data. Based on these results, more EDA can be conducted on the model results (*f*), such as residual analysis. For the previous example of predicting the sale prices of houses, the properties that are poorly predicted can be examined to understand if there is any systematic issues with the model. As an example, there may be particular ZIP codes that are difficult to accurately assess. Consequently, another round of feature engineering (*g*) might be used to compensate for these obstacles. By this point, it may be apparent which models tend to work best for the problem at hand and another, more extensive, round of model tuning can be conducted on fewer models (*h*). After more tuning and modification of the predictor representation, the two candidate models (#2 and #4) have been finalized. These models can be evaluated on an external test set as a final “bake off” between the models (*i*). The final model is then chosen (*j*) and this fitted model will be used going forward to predict new samples or to make inferences.

The point of this schematic is to illustrate there are far more activities in the process than simply fitting a single mathematical model. For most problems, it is common to have feedback loops that evaluate and reevaluate how well any model/feature set combination performs.

### 1.2.5 Model Bias and Variance

*Variance* is a well understood concept. When used in regard to data, it describes the degree in which the values can fluctuate. If the same object is measured multiple times, the observed measurements will be different to some degree. In statistics, *bias* is generally thought of as the degree in which something deviates from its true underlying value. For example, when trying to estimate public opinion on a topic, a poll could be systematically biased if the people surveyed over-represent a particular demographic. The bias would occur as a result of the poll incorrectly estimating the desired target.

Models can also be evaluated in terms of variance and bias. A model has high variance if small changes to the underlying data used to estimate the parameters cause a sizable change in those parameters (or in the structure of the model). For example, the sample mean of a set of data points has higher variance than the sample median. The latter uses only the values in the center of the data distribution and, for this reason, it is insensitive to moderate changes in the values. A few examples of models with *low variance* are linear regression, logistic regression, and partial least squares. High variance models include those that use individual data points to define their parameters such as classification or regression trees, nearest neighbor models, and neural networks. To contrast low variance and high variance models, consider linear regression and, alternatively, nearest neighbor models. Linear regression uses all of the data to estimate slope parameters and, while it can be sensitive to outliers, it is much less sensitive than a nearest neighbor model.

Model bias reflects the ability of a model to conform to the underlying theoretical structure of the data. A low bias model is one that can be highly flexible and has the capacity to fit a variety of different shapes and patterns. A high bias model would be unable to estimate values close to their true theoretical counterparts. Linear methods often have high bias since, without modification, cannot describe nonlinear patterns in the predictor variables. Tree-based models, support vector machines, neural networks, and others can be very adaptable to the data and have low bias.

As one might expect, model bias and variance can often be in opposition to one another; in order to achieve low bias, models tend to demonstrate high variance (and vice versa). The *variance-bias trade-off* is a common theme in statistics. In many cases, models have parameters that control the flexibility of the model and thus affect the variance and bias properties of the results. Consider a simple sequence of data points such as a daily stock price. A moving average model would estimate the stock price on a given day by the average of the data points within a certain window of the day. The size of the window can modulate the variance and bias here. For a small window, the average is much more responsive to the data and has a high potential to match the underlying trend. However, it also inherits a high degree of sensitivity to those data in the window and this increases variance. Widening the window will average more points and will reduce the variance in the model but will also desensitize the model fit potential by risking over-smoothing the data (and thus increasing bias).

Consider the example in Figure 1.5a that contains a single predictor and outcome where their relationship is nonlinear. The right-hand panel (b) shows two model fits. First, a simple three-point moving average is used (in green). This trend line is bumpy but does a good job of tracking the nonlinear trend in the data. The purple line shows the results of a standard linear regression model that includes a term for the predictor value and a term for the square of the predictor value. Linear regression is linear *in the model parameters* and adding polynomial terms to the model can be effective way of allowing the model to identify nonlinear patterns. Since the data points start low on the *y*-axis, reach an apex near a predictor value of 0.3 then decrease, a quadratic regression model would be a reasonable first attempt at modeling these data. This model is very smooth (showing low variance) but does not do a very good job of fitting the nonlinear trend seen in the data (i.e., high bias).

To accentuate this point further, the original data were “jittered” multiple times by adding small amounts of random noise to their values. This was done twenty times and, for each version of the data, the same two models were fit to the jittered data. The fitted curves are shown in Figure 1.6. The moving average shows a significant degree of noise in the regression predictions but, on average, manages to track the data patterns well. The quadratic model was not confused by the extra noise and generated very similar (although inaccurate) model fits.

The notions of model bias and variance are central to the ideas in this text. As previously described, simplicity is an important characteristic of a model. One method of creating a *low variance*, *low bias* model is to augment a low variance model with appropriate representations of the data to decrease the bias. The previous example in Section 1.1 is a simple example of this process; a logistic regression (high bias, low variance) was improved by modifying the predictor variables and was able to show results on par with a support vector machine model (low bias). As another example, the data in Figure 1.5a were generated using the following equation

\[y = x^3 + \left[\beta_1 \: exp(\beta_2 \: (x-\beta_3)^2)\right] + \epsilon\]

Theoretically, if this functional form could be determined from the data, then the best possible model would be a nonlinear regression model (low variance, low bias). We revisit the variance-bias relationship in Section 3.4 in the context of measuring performance using resampling.

In a similar manner, models can have reduced performance due to irrelevant predictors causing excess model variation. Feature selection techniques improve models by reducing the unwanted noise of extra variables.

### 1.2.6 Experience Driven Modeling and Empirically Driven Modeling

Projects may arise where no modeling has previously been applied to the data. For example, suppose that a new customer database becomes available and this database contains a large number of fields that are potential predictors. Subject matter experts may have a good sense of what features should be in the model based on previous experience. This knowledge allows experts to be prescriptive about exactly which variables are to be used and how they are represented. Their reasoning should be strongly considered given their expertise. However, since the models estimate parameters from the data, there can be a strong desire to be *data driven* rather than *experience driven*.

Many types of models have the ability to empirically discern which predictors should be in the model and can derive the representation of the predictors that can maximize performance (based on the available data). The perceived (and often real) danger in this approach is twofold. First, as previously discussed, data driven approaches run the risk of overfitting to false patterns in the data. Second, they might yield models that are highly complex and may not have any obvious rational explanation. In the latter case a circular argument may arise where practitioners only accept models that quantify what they already know but expect better results than what a human’s manual assessment can provide. For example, if an unexpected, novel predictor is found that has a strong relationship with the outcome, this may challenge the current conventional wisdom and be viewed with suspicion.

It is common to have some conflict between experience driven modeling and empirically driven modeling. Each approach has its advantages and disadvantages. In practice, we have found that a combination of the two approaches works best as long as both sides see the value in the contrasting approaches. The subject matter expert may have more confidence in a novel model feature if they feel that the methodology used to discover the feature is rigorous enough to avoid spurious results. Also an empirical modeler might find benefit in an expert’s recommendations to initially whittle down a large number of predictors or at least to help prioritize them in the modeling process. Also, the process of feature engineering requires some level of expertise related to what is being modeled. It is difficult to make recommendations on how predictors should be represented in a vacuum or without the knowing the context of the project. For example, in the simple example in Section 1.1, the inverse transformation for the predictors might have seemed obvious to an experienced practitioner.

### 1.2.7 Big Data

The definition of Big Data is somewhat nebulous. Typically, this term implies a large number of data points (as opposed to variables) and it is worth noting that the *effective sample size* might be smaller than the actual data size. For example, if there is a severe class imbalance or rare event rate, the number of events in the data might be fairly pedestrian. Click-through rate on online ads is a good example of this. Another example is when one particular region of the predictor space is abundantly sampled. Suppose a data set had billions of records but most correspond to white males within a certain age range. The number of distinct samples might be low, resulting in a data set that is not diverse.

One situation where large datasets probably doesn’t help is when samples are added within the mainstream of the data. This simply increases the granularity of the distribution of the variables and, after a certain point, may not help in the data analysis. More rows of data can be helpful when new areas of the population are being accrued. In other words, big data does not necessarily mean *better data*.

While the benefits of big data have been widely espoused, there are some potential drawbacks. First, it simple might not solve problems being encountered in the analysis. Big data cannot automatically induce a relationship between the predictors and outcome when none exists. Second, there are often computational ramifications to having large amounts of data. Many high variance/low bias models tend to be very complex and computationally demanding and the time to fit these models can increase with data size and, in some cases, the increase can be nonlinear. Adding more data allows these models to more accurately reflect the complexity of the data but would require specialized solutions to be feasible. This, in itself, is not problematic unless the solutions have the effect of restricting the types of models that can be utilized. It is better for the problem to dictate the type of model that is needed.

Additionally, not all models can exploit large data volumes. For high bias, lo variance models, big data tends to simply drive down the standard errors of the parameter estimates. For example, in a linear regression created on a million data records, doubling or tripling the amount of training data is unlikely to more the parameter estimates to any practical degree (all other things being equal).

However, there are models that can effectively leverage large data sets. In some domains, there can be large amounts of *unlabeled* data where the outcome is unknown but the predictors have been measured or computed. The classic examples are images and text but unlabeled data can occur in other situations. For example, pharmaceutical companies have large databases of chemical compounds that have been designed but their important characteristics have not been measured (which can be expensive). Other examples include public governmental databases where there is an abundance of data that have not been connected to a specific outcome.

Unlabeled data can be used to solve some specific modeling problems. For models that require formal probability specifications, determining multivariate distributions can be extremely difficult. Copious amounts of predictors data can help estimate or specify these distributions. Autoencoders, discussed in Section 6.3.2, are models that can denoise or smooth the predictor values. The outcome is not required to create an autoencoder so unlabeled data can potentially improve the situation.

Overall, when encountering (or being offered) large amounts of data, one might think to ask:

- What are you using it for? Does it solve some unmet need?
- Will it get in the way?

## 1.3 A More Complex Example

To illustrate the interplay between models and features we present a more realistic example^{10}. The case study discussed here involves predicting the ridership on Chicago “L” trains (i.e., the number of people entering a particular station on a daily basis). If a sufficiently predictive model can be built, the Chicago Transit Authority could use this model to appropriately staff trains and number of cars required per line. This data set is discussed in more detail in Section 4.1 but this section describes a series of models that were evaluated when the data were originally analyzed.

To begin, a simple set of four predictors was considered. These initial predictors, labelled as “Set 1”, were developed because they are simple to calculate and visualizations showed strong relationships with ridership (the outcome). A variety of different models were evaluated and the root mean squared error (RMSE) was estimated using resampling methods. Figure 1.7 shows the results for several different types of models (e.g., tree-based models, linear models, etc). RMSE values for the initial feature set ranged between 2331 and 3248 daily rides^{11}. With the same feature set, tree–based models had the best performance while linear models had the worst results. Additionally, there is very little variation in RMSE results within a model type (i.e., the linear model results tend to be similar to each other).

In an effort to improve model performance, some time was spent deriving a second set of predictors that might be used to augment the original group of four. From this, 128 numeric predictors were identified that were lagged versions of the ridership at different stations. For example, to predict the ridership one week in the future, today’s ridership would be used as a predictor (i.e. a seven day lag). This second set of predictors had an beneficial effect overall but were especially helpful to linear models (see the *x*-axis value of `{1, 2}`

in Figure 1.7). However, the benefit varied between models and model types.

Since the lag variables were important for predicting the outcome, more lag variables were created using lags between 8 and 14 days. Many of these variables show a strong correlation to the other predictors. However, models with predictor sets 1, 2, and 3 did not show much meaningful improvement above and beyond the previous set of models and, for some, the results were worse. One particular linear model suffered since this expanded set had a high degree of between-variable correlation. This situation is generally known as *multicollinearity* and can be particularly troubling for some models. Because this expanded group of lagged variables didn’t now show much benefit overall, it was not considered further.

When brainstorming which predictors could be added next, it seemed reasonable to think that weather conditions might affect ridership. To evaluate this conjecture, a fourth set of 18 predictors was calculated and used in models with the first two sets (labeled as `{1, 2, 4}`

). Like the third set, the weather did not show any relevance to predicting train ridership.

After conducting exploratory data analysis of residual plots associated with models with sets 1 and 2, a fifth set of 49 binary predictors were developed to address days where the current best models did poorly. These predictors resulted in a substantial drop in model error and were retained. Note that the improvement affected models differently and that, with feature sets 1, 2, and 5, the simple linear models yielded results that are on par with more complex modeling techniques.

The overall points that should be understood from this demonstration are:

- When modeling data, there is almost never a single model fit or feature set that will immediately solve the problem. The process is more likely to be a
*campaign*of trial and error to achieve the best results. - The effect of feature sets
*can*be much larger than the effect of different models. - The interplay between models and features is complex and somewhat unpredictable.
- With the right set of predictors, is it common that many different types of models can achieve the same level of performance. Initially, the linear models had the worst performance but, in the end, showed some of the best performance.

Techniques for discovering, representing, adding, and subtracting are discussed in subsequent chapters.

## 1.4 Feature Selection

In the previous example, new sets of features were derived sequentially to compensate to improve performance of the model. These sets were developed, added to the model, and then resampling was used to evaluate their utility. The new predictors were not prospectively filtered for statistical significance prior to adding them to the model. This would be a *supervised* procedure and care must be taken to make sure that overfitting is not occurring.

In that example, it was demonstrated that some of the predictors have enough underlying information to adequately predict the outcome (such as sets 1, 2, and 5). However, this collection of predictors might very well contain non-informative variables and this might impact performance to some extent. To whittle the predictor set to a smaller set that contains only the informative predictors, a supervised feature selection technique might be used. Additionally, there is the possibility that there are a small number of important predictors in sets 3 and 4 whose utility was not discovered because of all of the non-informative variables in these sets.

In other cases, all of the raw predictors are known and available at the beginning of the modeling process. In this case, a less sequential approach might be used by simply using a feature selection routine to attempt to sort out the best and worst predictors.

There are a number of different strategies for supervised feature selection that can be applied and these are discussed in Chapter 11. The main distinction between the methods are how subsets are derived:

*Wrapper*methods use an*external*search procedure to choose different subsets of the whole predictor set to evaluate in a model. This approach separates the feature search process from the model fitting process. Examples of this approach would be backwards or stepwise selection as well as genetic algorithms.*Embedded methods*are models where the feature selection procedure occurs naturally course of the model fitting process. Here an example would be a simple decision tree where variables are selected when the model uses them in a split. If a predictor is never used in a split, the prediction equation is functionally independent of this variable and it has been selected out.

As with model fitting, the main concern during feature selection is overfitting. This is especially true when wrapper methods are used and/or if the number of data points in the training set is small relative to the number of predictors.

Finally, *unsupervised* selection methods can have a very positive effect on model performance. Recall the California housing data discussed in Section 1. A property’s ZIP code might be a useful predictor in the model. While the raw value is coded as a number, the data truly reflects a qualitative value. Since most models require numbers for predictors, it is common to encode such data as *dummy variables*. In this case, the single ZIP code predictor, with 0 possible values, is converted to a set of -1 binary variables that have a value of one when a property is in that ZIP code and zero otherwise. While this is a well known and common approach, here it leads to cases where 0 of the ZIP codes have only one or two properties in these data, which is less than 1% of the overall set. With such a low frequency, such a predictors might have a detrimental effect on some models (such as linear regression) and removing them prior to building the model might be advisable.

When conducting a search for a subset of variables, it is important to realize that there may not be a unique set of predictors that will produce the best performance. There is often a compensatory effect where, when one seemingly important variable is removed, the model adjusts using the remaining variables. This is especially true when there is some degree of correlation between the explanatory variables or when a low bias models is used. For this reason, feature selection should not be used as a formal method of determining feature *significance*. More traditional inferential statistical approaches are a better solution for appraising the contribution of a predictor to the underlying model or to the data set.

## 1.5 An Outline of the Book

The goal of this text is to provide effective tools for uncovering relevant and predictively useful engineering of new predictors. These tools will be the bookends of the predictive modeling process. At the beginning of the process we will explore techniques for augmenting the predictor set. Then at the end of the process we will provide methods for filtering the enhanced predictor set to ultimately produce better models. These concepts will be detailed in Chapters 2-11 as follows.

We begin by providing a short illustration of the interplay between the modeling and feature engineering process (Chapter 2) In this example, we use feature engineering and feature selection methods to improve the ability of a model to predict the risk of ischemic stroke.

In Chapter 3 we will provide a review of the process for developing predictive models, which will include an illustration of the steps of data splitting, validation approach selection, model tuning, and performance estimation for future predictions. This chapter will also include guidance on how to use feedback loops when cycling through the model building process across multiple models.

Exploratory visualizations of the data are crucial for understanding relationships among predictors and between predictors and the response, especially for high-dimensional data. In addition, visualizations can be used to assist in understanding the nature of individual predictors including predictors’ skewness and missing data patterns. Chapter 4 will illustrate useful visualization techniques to explore relationships within and between predictors. Graphical methods for evaluating model lack-of-fit will also be presented.

Chapter 5 will focus on approaches for encoding discrete, or categorical, predictors. Here we will summarize standard techniques for representing categorical or ordinal (ordered categorical) predictors. Feature engineering methods for categorical predictors such as feature hashing are introduced as a method for using existing information to create new predictors that better uncover meaningful relationships. This chapter will also provide guidance on practical issues such as how to handle rare levels within a categorical predictor and the impact of creating dummy variables for tree and rule based models. Date-based predictors are present in many data sets and can be viewed as categorical predictors. Methods for encoding dates will also be demonstrated.

Engineering numeric predictors will be discussed in Chapter 6. As mentioned above, numeric predictors as collected in the original data may not be optimal for predicting the response. Univariate and multivariate transformations are a first step to finding better forms of numeric predictors. A more advanced approach is to use basis expansions (i.e. splines) to create better representations of the original predictors. In certain situations, transforming continuous predictors to categorical or ordinal bins reduces variation and helps to improve predictive model performance. Caveats to binning numerical predictors will also be provided.

Up to this point in the book, a feature has been considered as one of the observed predictors in the data. In Chapter 7 we will illustrate that important features for a predictive model could also be the interaction between two or more of the original predictors. Quantitative tools for determining which predictors interact with one another will be explored along with graphical methods to evaluate the importance of these types of effects. This chapter will also discuss the concept of estimability of interactions.

Working with profile data, such as time series (longitudinal), cellular-to-wellular, and image data will be addressed in Chapter 8. These kind of data are normally collected in the fields of finance, pharmaceutical, intelligence, transportation, and weather forecasting, and this particular data structure generates unique challenges to many models. Some modern predictive modeling tools such as partial least squares can naturally handle data in this format. But many other powerful modeling techniques do not have direct ways of working with this kind of data. These models require that profile data be summarized or collapsed prior to modeling. This chapter will illustrate techniques for using this kind of information in ways that strive to preserve the predictive information while creating a format that can be used across predictive models.

Every practitioner working with real-world data will encounter missing data at some point. While some predictive models (e.g. trees) have novel ways of handling missing data, other models do not and require complete data. Chapter 9 explores mechanisms that cause missing data and provides visualization tools for investigating missing data patterns. Traditional and modern tools for removing or imputing missing data are provided. In addition, the imputation methods are evaluated for continuous and categorical predictors.

It is tempting to take the tools provided in the previous chapters, apply them to the existing data, then build predictive models on the newly created features. However, a naive approach to these steps would lead to overfit models. Chapter 10 will describe the required steps to guard against overfitting when creating new features. This chapter will provide strategies for determining the best representation of model terms that minimize the risk of overfitting are discussed.

The feature engineering process as described in Chapters 5-8 can lead to many more predictors than what was contained in the original data. While some of the additional predictors will likely enhance model performance, not all of the original and new predictors will likely be useful for prediction. The final chapter will discuss feature selection and feature selection tactics as an overall strategy for improving model predictive performance. Important aspects include: the goals of feature selection, consequences of irrelevant predictors, comparisons with selection via regularization, and how to avoid overfitting (in the feature selection process). The ineffectiveness of traditional stepwise methods is also discussed.

### Feature Selection

Hill, A, P LaPan, Y Li, and S Haney. 2007. “Impact of Image Segmentation on High-Content Screening Data Quality for SK-BR-3 Cells.” *BMC Bioinformatics* 8 (1):340.

Hosmer, David, and Stanley Lemeshow. 2000. *Applied Logistic Regression*. 2nd ed. New York: John Wiley & Sons.

Tukey, John W. 1977. *Exploratory Data Analysis*. Reading, Mass.

Wolpert, David H. 1996. “The Lack of a Priori Distinctions Between Learning Algorithms.” *Neural Computation* 8 (7). MIT Press:1341–90.

Demsar, Janez. 2006. “Statistical Comparisons of Classifiers over Multiple Data Sets.” *Journal of Machine Learning Research* 7 (Jan):1–30.

Fernandez-Delgado, Manuel, Eva Cernadas, Senen Barro, and Dinani Amorim. 2014. “Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?” *Journal of Machine Learning Research* 15 (1):3133–81.

Also, to some extent, the choice of these terms is driven by whether a person is more computer science-centric or statistics-centric.↩

These types of data sets are discussed more in Section 3.3.↩

The predictor values were normalized to have mean zero and a standard deviation of one, as is needed for this model. However, this does not affect the skewness of the data.↩

This assumes that the data have been sufficiently

*cleaned*and that no erroneous values are present. Data cleaning can easily take an extended amount of time depending on the source of the data.↩This example will be analyzed at length in later chapters.↩

A RMSE value of 3000 can correspond to \(R^2\) values of between 0.80 and 0.90 in these data. However, as discussed in Section 3.2.1, \(R^2\) can be misleading here due to the nature of these data.↩