1 Introduction

Statistical models have gained importance as they have become ubiquitous in modern society. They enable us by generating various types of predictions in our daily lives. For example, doctors rely on general rules derived from models that tell them which specific cohorts of patients have an increased risk of a particular ailment or event. A numeric prediction of a flight’s arrival time can help understand if our airplane is likely to be delayed. In other cases, models are effective at telling us what is important or concrete. For example, a lawyer might utilize a statistical model to quantify the likelihood that potential hiring bias is occurring by chance or whether it is likely to be a systematic problem.

In each of these cases, models are created by taking existing data and finding a mathematical representation that has acceptable fidelity to the data. From such a model, important statistics can be estimated. In the case of airline delays, a prediction of the outcome (arrival time) is the quantity of interest while the estimate of a possible hiring bias might be revealed through a specific model parameter. In the latter case, the hiring bias estimate is usually compared to the estimated uncertainty (i.e., noise) in the data and a determination is made based on how uncommon such a result would be relative to the noise–a concept usually referred to as “statistical significance.” This type of model is generally thought of as being inferential: a conclusion is reached for the purpose of understanding the state of nature. In contrast, the prediction of a particular value (such as arrival time) reflects an estimation problem where our goal is not necessarily to understand if a trend or fact is genuine but is focused on having the most accurate determination of that value. The uncertainty in the prediction is another important quantity, especially to gauge the trustworthiness of the value generated by the model.

Whether the model will be used for inference or estimation (or in rare occasions, both), there are important characteristics to consider. Parsimony (or simplicity) is a key consideration. Simple models are generally preferable to complex models, especially when inference is the goal. For example, it is easier to specify realistic distributional assumptions in models with fewer parameters. Parsimony also leads to a higher capacity to interpret a model. For example, an economist might be interested in quantifying the benefit of postgraduate education on salaries. A simple model might represent this relationship between years of education and job salary linearly. This parameterization would easily facilitate statistical inferences on the potential benefit of such education. But suppose that the relationship differs substantially between occupations and/or is not linear. A more complex model would do a better job at capturing the data patterns but would be much less interpretable.

The problem, however, is that accuracy should not be seriously sacrificed for the sake of simplicity. A simple model might be easy to interpret but would not succeed if it does not maintain acceptable level of faithfulness to the data; if a model is only 50% accurate, should it be used to make inferences or predictions? Complexity is usually the solution to poor accuracy. By using additional parameters or by using a model that is inherently nonlinear, we might improve accuracy but interpretability will likely suffer greatly. This trade-off is a key consideration for model building.

Thus far the discussion has been focused on aspects of the model. However, the variables that go into the model (and how they are represented) are just as critical to success. It is impossible to talk about modeling without discussing models, but one of goal this book is to increase the emphasis on the predictors in a model.

In terms of nomenclature, the quantity that is being modeled or predicted is referred to as either: the outcome, response, or dependent variable. The variables that are used to model the outcome are called the predictors, features, or independent variables (depending on the context). For example, when modeling the sale price of a house (the outcome), the characteristics of a property (e.g., square footage, number of bed rooms and bath rooms) could be used as predictors (the term features would also be suitable). However, consider artificial model terms that are composites of one or more variables, such as the number of bedrooms per bathroom. This type of variable might be more appropriately called a feature (or a derived feature). In any case, features and predictors are used to explain the outcome in a model⁴.

Figure 1.1: Houses for sale in Ames, Iowa, colored by neighborhood.

As one might expect, there are good and bad ways of entering predictors into a model. In many cases, there are multiple ways that an underlying piece of information can be represented or encoded. Consider a model for the sale price of a property. The location is likely to be crucial and can be represented in different ways. Figure 1.1 shows locations for properties in and around Ames Iowa, that were sold between 2006 and 2010. In this image, the colors represent the reported neighborhood of residence. There are 28 neighborhoods represented here and the number of properties per neighborhoods range from a single property in Landmark, to 443 in North Ames. A second representation of location in the data is longitude and latitude. A realtor might suggest using ZIP code as a predictor in the model as a proxy for school district since this can be an important consideration for buyers with children. But from an information theory point of view, longitude and latitude offer the most specificity for measuring physical location and one might make an argument that this representation has higher information content (assuming that this particular information is predictive).

The idea that there are different ways to represent predictors in a model, and that some of these representations are better than others, leads to the idea of feature engineering - the process of creating representations of data that increase the effectiveness of a model.

Note that model effectiveness is influenced by many things. Obviously, if the predictor has no relationship to the outcome then its representation is irrelevant. However, it is very important to realize that there are a multitude of types of models and that each has its own sensitivities and needs. For example:

Some models cannot tolerate predictors that measure the same underlying quantity (i.e., multicollinearity or correlation between predictors).
Many models cannot use samples with any missing values.
Some models are severely compromised when irrelevant predictors are in the data.

Feature engineering and variable selection can help mitigate many of these issues. The goal of this book is to help practitioners build better models by focusing on the predictors. “Better” depends on the context of the problem but most likely involves the following factors: accuracy, simplicity, and robustness. To achieve these characteristics, or to make good trade-offs between them, it is critical to understand the interplay between predictors used in a model and the type of model. Accuracy and/or simplicity can sometimes be improved by representing data in ways that are more palatable to the model or by reducing the number of variables used. To demonstrate this point, a simple example with two predictors is shown in the next section. Additionally, a more substantial example is discussed in Section 1.3 that more closely resembles the modeling process in practice.

Also, to some extent, the choice of these terms is driven by whether a person is more computer science-centric or statistics-centric.↩