4 Exploratory Visualizations
The driving goal in everything that we do in the modeling process is to find reproducible ways to explain the variation that we see in the response. As discussed in the previous chapter, discovering patterns among the predictors that are related to the response involves selecting a resampling scheme to protect against overfitting, choosing a performance metric, tuning and training multiple models, and comparing model performance to identify which models have the best performance. When presented with a new data set, it is tempting to jump directly into the predictive modeling process to see if we can quickly develop a model that meets the performance expectations. Or, in the case where we have many predictors, the initial goal may be to use the modeling results to identify the most important predictors related to the response. But as illustrated in Figure 1.4, we should first spend a sufficient amount of time exploring the data. The focus of this chapter will be to present approaches for visually exploring data and to demonstrate how this approach can be used to help guide feature engineering.
One of the first steps of the exploratory data process when the ultimate purpose is to predict a response is to create visualizations that help elucidate knowledge of the response and then to uncover relationships between the predictors and the response. Therefore our visualizations should start with the response, understanding the characteristics of its distribution, and then to build outward from that with the additional information provided in the predictors. Knowledge about the response can be gained by creating a histogram or box plot. This simple visualization will reveal the amount of variation in the response and if the response was generated by a process that has unusual characteristics that we must investigate further. Next, we can move on to exploring relationships among the predictors and between predictors and the response. Important characteristics can be identified by examining
- scatter plots of individual predictors and the response,
- a pairwise correlation plot among the predictors,
- a projection of high dimensional predictors into a lower dimensional space,
- line plots for time-based predictors,
- the first few levels of a regression or classification tree,
- a heat map across the samples and predictors, or
- mosaic plots for examining associations among categorical variables.
These visualizations provide insights that should be used to inform the initial models. It is important to note that some of the most useful visualizations for exploring our data are not necessarily complex or difficult to create. In fact, a simple scatter plot can elicit insights that a model may not be able to uncover, and can lead to the creation of a new predictor or to a transformation of a predictor or the response that improves model performance. The challenge here lies in developing intuition for knowing how to visually explore data to extract information for improvement. As illustrated in Figure 1.4, exploratory data analysis should not stop at this point, but should continue after initial models have been built. Post model building, visual tools can be used to assess model lack-of-fit and to evaluate the potential effectiveness of new predictors that were not in the original model.
In this chapter, we will delve into a variety of useful visualization tools for exploring data prior to constructing the initial model. Some of these tools can then be used after the model is built to identify features that can improve model performance. Following the outline of Figure 1.4, we will look at visualizations prior to modeling, then during the modeling process. We also refer the reader to Tufte (1990) and Cleveland (1993) which are excellent resources for visualizing data.
To illustrate these tools, we will use the Chicago Train Ridership data for numeric visualizations and the OkCupid data for categorical visualizations.
4.1 Introduction to the Chicago Train Ridership Data
To illustrate how exploratory visualizations are a critical part of understanding a data set and how visualizations can be used to identify and uncover representations of the predictors that aid in improving the predictive ability of a model, we will be using data collected on ridership on the Chicago Transit Authority (CTA) “L” train system33 (Figure 4.1). In the short term, understanding ridership over the next one or two weeks would allow the CTA to ensure that an optimal number of cars were available to service the Chicagoland population. As a simple example of demand fluctuation, we would expect that ridership in a metropolitan area would be stronger on weekdays and weaker on weekends. Two common mistakes of misunderstanding demand could be made. At one extreme, having too few cars on a line to meet weekday demand would delay riders from reaching their destination and would lead to overcrowding and tension. At the other extreme, having too many cars on the weekend would be inefficient leading to higher operational costs and lower profitability. Good forecasts of demand would help the CTA to get closer to optimally meeting demand.
In the long term, forecasting could be used to project when line service may be necessary, to change the frequency of stops to optimally service ridership, or to add or eliminate stops as populations shift and demand strengthens or weakens.
For illustrating the importance of exploratory visualizations, we will focus on short term forecasting of daily ridership. Daily ridership data were obtained for 126 stations between January 22, 2001 and September 11, 2016. Ridership is measured by the number of entries into a station across all turnstiles, and the number of daily riders across stations during this time period varied considerably, ranging between 0 and 36,323 per day. For ease of presentation, ridership will be shown and analyzed in units of thousands of riders.
Our illustration will narrow to predicting daily ridership at the Clark/Lake stop. This station is an important stop to understand; it is located in the downtown loop, has one of the highest riderships across the train system, and services four different lines covering northern, western, and southern Chicago.
For these data, the ridership numbers for the other stations might be important to the models. However, since we are interested in predictor future ridership volume, only historical data would be available at the time of prediction. For time series data, predictors are often formed by lagging the data. For this application, when predicting day \(D\), predictors were created using the lag–14 data from every station (e.g. ridership at day \(D-14\)). Other lags can also be added to the model, if necessary.
Other potential regional factors that may affect public transportation ridership are the weather, gasoline prices, and the employment rate. For example, it would be reasonable to expect that more people would use public transportation as the cost of using personal transportation (i.e. gas prices) increase for extended periods of time. Likewise, the use of public transportation would likely increase as the unemployment rate decreases. Weather data were obtained for the same time period as the ridership data. The weather information was recorded hourly and included many conditions such as overcast, freezing rain, snow, etc.. A set of predictors were created that reflects conditions related to rain, snow/ice, clouds, storms, and clear skies. Each of these categories is determined by pooling more granular recorded conditions. For example, the rain predictor reflects recorded conditions that included “rain,” “drizzle,” and “mist”. New predictors were encoded by summarizing the hourly data across a day by calculating the percentage of observations within a day where the conditions were observed. As an illustration, on December 20, 2012, conditions were recorded as snowy (12.8% of the day), cloudy (15.4%), as well as stormy (12.8%), and rainy (71.8%). Clearly, there is some overlap in the conditions that make up these categories, such as cloudy and stormy.
Other hourly weather data were also available related to the temperature, dew point, humidity, air pressure, precipitation and wind speeds. Temperature was summarized by the daily minimum, median, and maximum as well as the daily change. Pressure change was similarly calculated. In most other cases, the median daily value was used to summarize the hourly records. As with the ridership data, future weather data are not available so lagged versions of these data were used in the models34. In total, there were 18 weather related predictors available.
In addition to daily weather data, average weekly gasoline prices were obtained for the Chicago region (U.S. Energy Information Administration 2017a) from 2001 through 2016. For the same time period, monthly unemployment rates were pulled from United States Census Bureau (2017). The potential usefulness of these predictors will be discussed below.
4.2 Visualizations for Numeric Data: Exploring the Chicago Train Ridership Data
4.2.1 Box Plots, Violin Plots, and Histograms
Univariate visualizations are used to understand the distribution of a single variable. A few common univariate visualizations are box-and-whisker plots (i.e. box plot), violin plots, or histograms. While these are simple graphical tools, they provide great value in comprehending characteristics of the quantity of interest.
Because the foremost goal of modeling is to understand variation in the response, our first step should be to understand the distribution of the response. For a continuous response such as the ridership at the Clark/Lake station, it is important to understand if the response has a symmetric distribution, if the distribution has a decreasing frequency of larger observations (i.e. the distribution is skewed), if the distribution appears to be made up of two or more individual distributions (i.e. the distribution has multiple peaks or modes), or if there appears to be unusually low or high observations (i.e outliers).
Understanding the distribution of the response as well as its variation provides a lower bound of our expectations of model performance. That is, if a model contains meaningful predictors, then the residuals from a model that contains these predictors should have less variation than what we see in the response. Furthermore, the distribution of the response may indicate that the response should be transformed prior to analysis. For example, responses that have a distribution where the frequency of response proportionally decreases with larger values may indicate that the response follows a log-normal distribution. In this case, log-transforming the response would induce a normal (bell-shaped, symmetric) distribution and often will enable a model to have better predictive performance. A third reason why we should work to understand the response is because the distribution may provide clues for including or creating features that help explain the response.
As a simple example of the importance of understanding the response distribution, consider Figure 4.2 which displays a box plot of the response for the ridership at the Clark/Lake station. The box plot was originally developed by John Tukey as a quick way to assess a variable’s distribution (Tukey 1977), and consists of the minimum, lower quartile, median, upper quartile and maximum of the data. Alternative versions of the box plot extend the whiskers to a value beyond which samples would be considered unusually high (or low) (Frigge, Hoaglin, and Iglewicz 1989). A variable that has a symmetric distribution has equal spacing across the quartiles making the box and whiskers also appear symmetric. Alternatively, a variable that has fewer values in a wider range of space will not appear symmetric.
A drawback of the box plot is that it is not effective at identifying distributions that have multiple peaks or modes. As an example, let’s now look at the distribution of ridership at the Clark/Lake station (Figure 4.3). Part (a) of this figure is a histogram of the data. To create a histogram, the data is binned into equal regions of the variable’s value. The number of samples are counted in each region, and a bar is created with the height of the frequency (or percentage) of samples in that region. Like box plots, histograms are simple to create, and these figures offer the ability to see additional distributional characteristics. In the ridership distribution, there are two peaks, which could represent two different mechanisms that affect ridership. The box plot (b) is unable to capture this important nuance. To achieve a compact visualization of the distribution that retains histogram-like characteristics, Hintze and Nelson (1998) developed the violin plot. This plot is created by generating a density or distribution of the data and its mirror image. Figure 4.3 (c) is the violin plot, where we can now see the two distinct peaks in ridership distribution. The lower quartile, median, and upper quartile can be added to a violin plot to also consider this information in the overall assessment of the distribution.
These data will be analyzed in several chapters. Given the range of the daily ridership numbers, there was some question as to whether the outcome should be modeled in the natural units or on the log scale. On one hand, the natural units makes interpretation of the results easier since the RMSE would be in terms of riders. However, if the outcome were transformed prior to modeling, it would ensure than negative ridership could not be predicted. The bimodal nature of these data, as well as distributions of ridership for each year that have a longer tail on the right made this decision difficult. In the end, a handful of models were fit both ways to make the determination. The models computed in the natural units appeared to have slightly better performance and, for this reason, all models were analyzed without a transformation of the response (as see in Figure 1.7).
Examining the distribution of each predictor can help to guide our decisions about the need to engineer the features through transformation prior to analysis. When we have a moderate number of predictors (< ~100) and when the predictors are on the same order of magnitude, we can visualize the distributions simultaneously using side-by-side box or violin plots. Consider again the ridership data with the two-week lag in ridership as predictors. The distributions across these stations for weekday ridership for 2016 are provided in Figure 4.4. To help see patterns more clearly, ridership is ordered from the largest median (left) to the smallest median (right). Several characteristics stand out: variability in ridership increases with the median ridership, there are a number of unusually low and unusually high values for each station, and a few stations have distinctly large variation. On station particularly stands out, which is about one-quarter of the way from the left. This happens to be the Addison station which is the nearest stop to Wrigley Field. The wider distribution is due to ridership associated with the weekday home games for the Chicago Cubs, with attendance at its peak reaching close to the most frequently traveled stations. If we were trying to predict ridership at the Addison station, then the Cubs’ home game schedule would be important information for any model. The unusually low values for the majority of the stations will be discussed next.
As the number of predictors grows, our ability to visualize the individual distributions lessens and may be practically impossible. In this situation, a subset of predictors that are thought to be important can be examined using these techniques.
4.2.2 Augmenting Visualizations through Faceting, Colors, and Shapes
Additional dimensions can be added to almost any figure by using faceting, colors, and shapes. Faceting refers to creating the same type of plot (e.g. a scatterplot) and splitting the plot into different panels based on some variable35. Figure 3.2 is a good example. While this is a simple approach, these types of augmentation can be powerful tools for seeing important patterns that can be used to direct the engineering of new features. The Clark/Lake station ridership distribution is a prime candidate for adding another dimension. As we saw above, Figure 4.3 has two distinct peaks. A reasonable explanation for this would be that ridership is different for weekdays than for weekends. Figure 4.5 partitions the ridership distribution by part of the week through color and faceting (for ease of visualization). Part of the week was not a predictor in the original data set; by using intuition and carefully examining the distribution, we have found a feature that is important in explaining the response that is necessary for explaining modeling ridership.
Figure 4.5 invites us to pursue understanding of these data further. Careful viewing of the weekday ridership distribution should draw our eye to a long tail on the left which is a result of a number of days with lower ridership similar to the range of ridership on weekends. What would cause weekday ridership to be low? If we can uncover the cause, then we can engineer a feature. A model that has the ability to explain these lower values will have better predictive performance than a model that does not.
The use of colors and shapes for elucidating predictive information will be illustrated several of the following sections.
4.2.3 Scatter Plots
Augmenting visualizations through the use of faceting, color, or shapes is one way to incorporate an additional dimension in a figure. Another approach is to directly add another dimension to a graph. When working with two numeric variables, this type of graph is called a scatter plot. A scatter plot arranges one variable on the x-axis and another on the y-axis. Each sample is then plotted in this coordinate space. We can use this type of figure to assess the relationship between a predictor and the response, to uncover relationships between pairs of predictors, an to understand if a new predictor may be useful to include in a model. These simple relationships are the first to provide clues to assist in engineering characteristics that may not be directly available in the data.
If we are trying to predict ridership at the Clark/Lake station then we could anticipate that recent past ridership information should be related to current ridership. That is to say another potential predictor to consider would be the previous day’s or previous week’s ridership information. Because we know that weekday and weekend have different distributions, a one-day lag would be less useful for predicting ridership on Monday or Saturday. A week-based lag would not have this difficulty (although it would be further apart in time) since the information occurs on the same day of the week. Because we are interested in predicting ridership two weeks in advance, we will create the 14-day lag in ridership for the Clark/Lake station.
In this case, we can directly understand the relationship between these variables by creating a scatter plot (Figure 4.6). This figure tells us several characteristics that we need to know: there is a strong linear relationship between the 14-day lag and current day ridership, there are two distinct groups of points (due to the part of the week), and there are many 14-day lag/current day pairs of days that lie far off from the overall scatter of points. These results indicate that the 14-day lag will be a crucial predictor for explaining current day ridership. Moreover, uncovering the explanation of samples that are far off from the overall pattern visualized here will lead us to a new feature that will be useful as a input to models.
Low weekday ridership as illustrated in Figure 4.5 might be due to annual occurrences; to investigate this hypothesis, we will need to augment the data. The first step is to create an indicator variable for weekdays with ridership less than 10,000 or greater than or equal to 10,000. We then need a visualization that allows us to see when these unusual values occur. A visualization that would elucidate annual patterns in this context is a heatmap. A heatmap is a versatile plot that can be created utilizing almost any type of predictor and displays one predictor on the x-axis and another predictor on the y-axis. In this figure the x- and y-axis predictors must be able to be categorized. The categorized predictors then form a grid, and the grid is filled by another variable. The filling variable can be either continuous or categorical. If continuous, then the boxes in the grid are colored on a continuous scale from the lowest value of the filling predictor to the highest value. If the filling variable is categorical, then the boxes have distinct colors for each category.
For the ridership data, we will create a month and day predictor, a year predictor, and an indicator of weekday ridership less than 10,000 rides.
These new features are the inputs to the heatmap (Figure 4.7). In this figure, the x-axis represents the year and the y-axis represents the month and day. Red boxes indicate weekdays that have ridership less than 10,000 for the Clark/Lake station. The heat map of the data in this form brings out some clear trends. Low ridership occurs on or around the beginning of the year, mid-January, mid-February until 2007, late-May, early-July, early-September, late-November, and late-December. Readers in the US would recognize these patterns as regularly observed holidays. Because holidays are known in advance, adding a feature for common weekday holidays will be beneficial for models to explain ridership.
Carefully observing the heatmap points to two days that do not follow the annual patterns: February 2, 2011 and January 6, 2014. These anomalies were due to extreme weather. On February 2, 2011, Chicago set a record low temperature of -16F. Then on January 6, 2014, there was a blizzard that dumped 21.2 inches of snow on the region. Extreme weather instances are infrequent, so adding this predictor will have limited usefulness in a model. If the frequency of extreme weather increases in the future, then using forecast data could become a valuable predictor for explaining ridership.
Now that we understand the effect of major US holidays, we will exclude these values from the scatterplot of 14-day lag versus current day ridership (Figure 4.8). Most of the points that fell off the diagonal of Figure 4.6 are now gone. However a couple of points remain. The day associated with these points was June 11, 2010 which was the city’s celebration for the Chicago Blackhawks winning the Stanley Cup. While these types of celebrations are infrequent, engineering a feature to anticipate these unusual events will aid in reducing the prediction error for a model36.
4.2.5 Correlation Matrix Plots
An extension of the scatter plot is the correlation matrix plot. In this plot, the correlation between each pair of variables is plotted in the form of a matrix. Every variable is represented on the outer x-axis and outer y-axis of the matrix, and the strength of the correlation is represented by the color in the respective location in the matrix. We first saw this visualization in Figure 2.3. Here will will construct a similar image for the 14-day lag in ridership across stations for non-holiday, weekdays in 2016 for the Chicago data.