4.3 Visualizations for Categorical Data: Exploring the OkCupid Data

To illustrate different visualization techniques for qualitative data, the OkCupid data are used. These data were first introduced in Section 3.1. Recall that the training set consisted of 38,809 profiles and the goal was to predict whether the profile’s author was worked in a STEM field. The event rate is 18.5% and most predictors were categorical in nature. These data are discussed more in the next chapter.

4.3.1 Visualizing Relationships between Outcomes and Predictors

Traditionally, bar charts are used to represent counts of categorical values. For example, Figure 4.14(a) shows the frequency of the stated religion, partitioned and colored by outcome category. The virtue of this plot is that it is easy to see the most and least frequent categories. However, it is otherwise problematic for several reasons:

  1. To understand if any religions are associated with the outcome, the reader’s task is to visually judge the ratio of each dark blue bar to the corresponding light blue bar across all religions and to then determine if any of the ratios are different from random chance. This figure is ordered from greatest ratio (left) to least ratio (right) which, in this form, may be difficult for the reader see.

  2. The plot is indirectly illustrating the characteristic of data that we are interested in, specifically, the ratio of frequencies between the STEM and non-STEM profiles. We don’t care how many Hindus are in STEM fields; instead, the ratio of fields within the Hindu religion is the focus. In other words, the plot obscures the statistical hypothesis that we are interested in: is the rate of Hindu STEM profiles different than what we would expect by chance?

  3. If the rate of STEM profiles within a religion is the focus, bar charts give no sense of uncertainty in that quantity. In this case, the uncertainty comes from two sources. First, the number of profiles in each religion can obviously affect the variation in the proportion of STEM profiles. This is illustrated by the height of the bars but this isn’t a precise way to illustrate the noise. Second, since the statistic of interest is a proportion, the variability in the statistic becomes larger as the rate of STEM profiles approaches 50% (all other things being equal).

To solve the first two issues, we might show the within-religion percentages of the STEM profiles. Figure 4.14(b) shows this alternative version of the bar chart, which is an improvement since the proportion of STEM profiles is now the focus. It is more obvious how the religions were ordered and we can see how much each deviates from the baseline rate of 18.4%. However, we still haven’t illustrated the uncertainty. Therefore, we cannot assess if the rates for the profiles with no stated religion are truly different from agnostics. Also, while Figure 4.14(b) directly compares proportions across religions, it does not give any sense of the frequency of each religion. For these data there are very few Islamic profiles; this important information cannot be seen in this display.

Three different visualizations of the relationship between religion and the outcome.

Figure 4.14: Three different visualizations of the relationship between religion and the outcome.

Figure 4.14(c) solves all three of the problems listed above. For each religion, the proportion of STEM profiles is calculated and a 95% confidence interval is shown to help understand what the noise is around this value. We can clearly see which religions deviate from randomness and the width of the error bars help the reader understand how much each number should be trusted. This image is the best of the three since it directly shows the magnitude of the difference as well as the uncertainty. In situations where the number of categories on the x-axis is large, a volcano plot can be used to show the results (see Figures 2.6 and 5.4).

The point of this discussion is not that summary statistics with confidence intervals are always the solution to a visualization problem. The takeaway message is that each graph should have a clearly defined hypothesis and that this hypothesis is shown concisely in a way that allows the reader to make quick and informative judgments based on the data.

Finally, does religion appear to be related to the outcome? Since there is a gradation of rates of STEM professions between the groups, it would appear so. If there were no relationship, all of the rates would be approximately the same.

How would one visualize the relationship between a categorical outcome and a numeric predictor? As an example, total length of all the profile essays will be used to illustrate a possible solution. There were 1197 profiles in the training set where the user did not fill out any of the open text fields. In this analysis, all of the nine answers were concatenated into a single text string The distribution of the text length was very left-skewed with a median value of 1,853 characters. The maximum length was approximately 59K characters although 10% of the profiles contained less than 433 characters. To investigate this, the distribution of the total number of essay characters is shown in Figure 4.15(a). The x-axis is the log of the character count and profiles without essay text are shown here as a zero. The distributions appear to be extremely similar between classes so this predictor (in isolation) is unlikely to be important by itself. However, as discussed above, it would be better to try to directly answer the question.

To do this, another smoother is used to model the data. In this case, a regression spline smoother (Wood 2006) is used to model the probability of a STEM profile as a function of the (log) essay length. This involves fitting a logistic regression model using a basis expansion. This means that our original factor, log essay length, is used to create a set of artificial features that will go into a logistic regression model. The nature of these predictors will allow a flexible, local representation of the class probability across values of essay length and are discussed more in Section 6.2.141.

The effect of essay length on the outcome.

Figure 4.15: The effect of essay length on the outcome.

The results are shown in Figure 4.15(b). The black line represents the class probability of the logistic regression model and the bands denote 95% confidence intervals around the fit. The horizontal red line indicates the baseline probability of STEM profiles from the training set. Prior to a length of about \(10^{1.5}\), the profile is slightly less likely than chance to be STEM. Larger profiles show and increase in the probability. This might seem like a worthwhile predictor but consider the scale of the y-axis. If this were put in the full probability scale of [0, 1], the trend would appear virtually flat. At most, the increase in the likelihood of being a STEM profile is almost 3.5%. Also, note that the confidence bands rapidly expand around \(10^{1.75}\) mostly due to a decreasing number of data points in this range so the potential increase in probability has a high degree of uncertainty. This predictor might be worth including in a model but is unlikely to show a strong effect on its own.

4.3.2 Exploring Relationships Between Categorical Predictors

Before deciding how to use predictors that contain non-numeric data, it is critical to understand their characteristics and relationships with other predictors. The unfortunate practice that is often seen relies on a large number of basic statistical analyses of two-way tables that boil relationships down to numerical summaries such as the Chi-squared (\(\chi^2\)) test of association. Often the best approach is to visualize the data. When considering relationships between categorical data, there are several options. Once a cross-tabulation between variables is created, mosaic plots can once again be used to understand the relationship between variables. For the OkCupid data, it is conceivable that the questionnaires for drug and alcohol use might be related and, for these variables, Figure 4.16 shows the mosaic plot. For alcohol, the majority of the data indicate social drinking while the vast majority of the drug responses were “never” or missing. Is there any relationship between these variables? Do any of the responses “cluster” with others?

A mosaic plot of the drug and alcohol data in the OkCupid data.

Figure 4.16: A mosaic plot of the drug and alcohol data in the OkCupid data.

These questions can be answered using correspondence analysis (Greenacre 2017) where the cross-tabulation is analyzed. In a contingency table, the frequency distributions of the variables can be used to determine the expected cell counts which mimic what would occur if the two variables had no relationship. The traditional \(\chi^2\) test uses the deviations from these expected values to assess the association between the variables by adding up functions of this type of cell residuals. If the two variables in the table are strongly associated, the overall \(\chi^2\) statistic is large. Instead of summing up these residual functions, correspondence analysis analyzes them to determine new variables that account for the largest fraction of these statistics42. These new variables, called the principal coordinates, can be computed for both variables in the table and shown in the same plot. These plots can contain several features:

  • The data on each axis would be evaluated in terms of how much of the information in the original table that the principal coordinate accounts for. If a coordinate only captures a small percentage of the overall \(\chi^2\) statistic, the patterns shown in that direction should not be over-interpreted.

  • Categories that fall near the origin represent the “average” value of the data From the mosaic plot, it is clear that there are categories for each variable that have the largest cell frequencies (e.g., “never” drugs and “social” alcohol consumption). More unusual categories are located on the outskirts of the principal coordinate scatter plot.

  • Categories for a single variable whose principal coordinates are close to one another are indicative of redundancy, meaning that there may be the possibility of pooling these groups.

  • Categories in different variables that fall near each other in the principal coordinate space indicate an association between these categories.

In the cross-tabulation between alcohol and drug use, the \(\chi^2\) statistic is very large (4043.8) for its degrees of freedom (18) and is associated with a very small p-value (0). This indicates that there is a strong association between these two variables.

Figure 4.17 shows the principal coordinates from the correspondence analysis. The component on the x-axis accounts for more than half of the \(\chi^2\) statistics. The values around zero in this dimension tends to indicate that no choice was made or sporadic substance use. To the right, away from zero, are less frequently occurring values. A small cluster indicates that occasional drug use and frequent drinking tend to have a specific association in the data. Another, more extreme, cluster shows that using alcohol very often and drugs have an association. The y-axis of the plot is mostly driven by the missing data (understandably since 26% of the table has at least one missing response) and accounts for another third of the \(\chi^2\) statistics. These results indicate that these two variables have similar results and might be measuring the same underlying characteristic.

The correspondence analysis principal coordinates for the drug and alcohol data in the OkCupid data.

Figure 4.17: The correspondence analysis principal coordinates for the drug and alcohol data in the OkCupid data.


  1. There are other types of smoothers that can be used to discover potential nonlinear patterns in the data. One, called loess, is very effective and uses a series of moving regression lines across the predictor values to make predictions at a particular point (Cleveland 1979).

  2. The mechanisms are very similar to principal component analysis (PCA), discussed in Chapter 6. For example, both use a singular value decomposition to compute the new variables.