5.6 Creating Features from Text Data

Often, data contain textual fields that are gathered from questionnaires, articles, reviews, tweets, and other sources. For example, the OkCupid data contains the responses to nine open text questions, such as “my self-summary” and “six things I could never do without”. The open responses are likely to contain important information relative to the outcome. Individuals in a STEM field are likely to have a different vocabulary than, say, someone in the humanities. Therefore the words or phrases used in these open text fields could be very important to predicting the outcome. This kind of data is qualitative and requires more effort to put in a form that models can consume. How then can these data be explored and represented for a model?

For example, some profile text answers contained links to external websites. A reasonable hypothesis is that the presence of a hyperlink could be related to the person’s profession. Table 5.4 shows the results for the random subset of profiles. The rate of hyperlinks in the STEM profiles was 21% , while this rate was 12.4% in the non-STEM profiles. One way to evaluate a difference in two proportions is called the odds-ratio (Agresti 2012). First, the odds of an event that occurs with rate \(p\) is defined as \(p/(1-p)\). For the STEM profiles, the odds of containing a hyperlink are relatively small with a value of \(0.21/0.79 = 0.27\). For the non-STEM profiles, it is even smaller (0.142). However, the ratio of these two quantities can be used to understand the effect of having a hyperlink would be between the two professions. In this case, this indicates that the odds of a STEM profile is is nearly 1.9-times higher when the profile to contains a link. Basic statistics can be used to assign a lower 95% confidence interval on this quantity. For these data the lower bound is 1.7, which indicates that this increase in the odds is unlikely to be due to random noise since it does not include a value of 1.0. Given these results, the indicator of a hyperlink will likely benefit a model and should be included.

Table 5.4: A cross-tabulation between the existence of at least one hyperlink in the OkCupid essay text and the profession.
stem other
Link 1063 620
No Link 3937 4380

Are there words or phrases that would make good predictors of the outcome? To determine this, the text data must first be processed and cleaned. At this point, there were 63,440 distinct words in the subsample of 10,000 profiles. A variety of features were computed on the data such as the number of commas, hashtags, mentions, exclamation points, and so on. This resulted in a set of 13 new “text-related” features. In these data, there are an abundance of HTML markup tags in the text, such as <br /> and <a href= \"link\">. These were removed from the data, as well as punctuation, line breaks, and other symbols. Additionally, there were some words that were nonsensical repeated characters, such as *** or aaaaaaaaaa, that were removed. Additional types of preprocessing of text are described below and, for more information, Christopher, Prabhakar, and Hinrich (2008) is a good introduction to the analysis of text data while Silge and Robinson (2017) is an excellent reference on the computational aspects of analyzing such data.

Given this set of 63,440 words and their associated outcome classes, odds-ratios can be computed. First, the words were again filtered so that terms with at least 50 occurrences in the 10,000 profiles would be analyzed. This cutoff was determined so that there was a sufficient frequency for modeling. With this constraint the potential number of keywords was reduced to 4,918 terms. For each of these, the odds-ratio and associated p-value were computed. The p-value tests the hypothesis that the odds of the keyword occurring in either professional group are equal (i.e., 1.0). However, the p-values can easily provide misleading results for two reasons:

  • In isolation, the p-value only relates the the question “Is there a difference in the odds between the two groups?” The more appropriate question is “How much of a difference is there between the groups?” The p-value does not measure the magnitude of the differences.

  • When testing a single (predefined) hypothesis, the false positive rate for a hypothesis test using \(\alpha = 0.05\) is 5%. However, there are 4,918 tests on the same data set which raises the false positive rate exponentially. The false-discovery rate (FDR) p-value correction (Efron and Hastie 2016) was developed for just this type of scenario. This procedure uses the entire distribution of p-values and attenuates the false positive rate. If a particular keyword has an FDR value of 0.30, this implies that the collection of keywords with FDR values less than 0.30 have collective 30% false discovery rate. For this reason, the focus here will be on the FDR values generated using the Benjamini-Hochberg correction (Benjamini and Hochberg 1995).

To characterize these results, a volcano plot is shown in Figure 5.4 where the estimated odds-ratio is shown on the x-axis and the minus log of the FDR values are shown on the y-axis (where larger values indicate higher statistical significance). The size of the points is associated with the number of events found in the sampled data set. Keywords falling in the upper left- and right-rand side would indicate a strong difference between the classes and that is unlikely to be random result. The plot shows far more keywords that have a higher likelihood to be found in the STEM profiles than in the non-STEM professions as more points fall on the upper right-hand side of one on the x-axis. Several of these have extremely high levels of statistical significance with FDR values that are vanishingly small.

Figure 5.4: A volcano plot of the keyword analysis. Each dot represents a word from the OkCupid essays, and the size of the dot represents the frequency of the occurrence. Words in the upper right are strongly associated with STEM profiles.

As a rough criteria for “importance”, keywords with odds-ratios of at least 2 (in either direction) and an FDR value less than \(10^{-5}\) will be considered for modeling. This results in 52 keywords:

alot apps biotech code coding
computer computers data developer electronic
electronics engineer engineering firefly fixing
futurama geek geeky im internet
lab law lawyer lol math
matrix mechanical mobile neal nerd
pratchett problems programmer programming science
scientist silicon software solve solving
startup stephenson student systems teacher
tech technical technology valley web
websites wikipedia

Of these keywords, only 7 were enriched in the non-STEM profiles: im, lol, teacher, student, law, alot, lawyer. Of the STEM-enriched keywords, many of these make sense (and play to stereotypes). The majority are related to occupation (e.g., engin, startup, and scienc), while others are clearly related to popular geek culture, such as firefli47, neal, and stephenson48, and scifi.

One final set of 9 features were computed related to the sentiment and language of the essays. There are curated collections of words with assigned sentiment values. For example, “horrible” has a fairly negative connotation while “wonderful” is associated with positivity. Words can be assigned qualitative assessments (e.g., “positive”, “neutral”, etc.) or numeric scores where neutrality is given a value of zero. In some problems, sentiment might be a good predictor of the outcome. This feature set included sentiment-related measures, as well as measures of the point-of-view (i.e., first-, second-, or third-person text) and other language elements.

Do these features have an impact on a model? A series of logistic regression models were computed using different feature sets:

  1. A basic set of profile characteristics unrelated to the essays (e.g., age, religion, etc.) consisting of 160 predictors (after generating dummy variables). This resulted in a baseline area under the ROC curve of 0.77. The individual resamples for this model are shown in Figure 5.5.

  2. The addition of the simple text features increased the number of predictors to 173. Performance was not appreciably affected as the AUC value was 0.776. These features were discarded from further use.

  3. The keywords were added to the basic profile features. In this case, performance jumped to 0.839. Recall that these were derived form a smaller portion of these data and that this estimate may be slightly optimistic. However, the entire training set was cross-validated and, if overfitting was severe, the resampled performance estimates would not be very good.

  4. The sentiment and language features were added to this model, producing an AUC of 0.841. This indicates that these aspects of the essays did not provide any predictive value above and beyond what was already captured by the previous features.

Overall, the keyword and basic feature model would be the best version found here49.

Resampling results for a series of logistic regression models computed with different feature sets.

Figure 5.5: Resampling results for a series of logistic regression models computed with different feature sets.

The strategy shown here for computing and evaluating features from text is fairly simplistic and is not the only approach that could be taken. Other methods for preprocessing text data include:

  • removing commonly used stop words, such as “is”, “the”, “and”, etc.
  • stemming the words so that similar words, such as the singular and plural versions, are represented as a single entity.

The SMART lexicon of stop words (Lewis et al. 2004) was filtered out of the current set of words, leaving 62,928 unique results. The words were then “stemmed” using the Porter algorithm (Willett 2006) to convert similar words to a common root. For example, these 7 words are fairly similar: "teach", "teacher", "teachers", "teaches", "teachable", "teaching", "teachings". Stemming would reduce these to 3 unique values: "teach", "teacher", "teachabl" Once these values were stemmed, there were 45,486 unique words remaining. While this does reduce the potential number of terms to analyze, there are potential drawbacks related to a reduced specificity of the text. Schofield and Mimno (2016) and Schofield, Magnusson, and Mimno (2017) demonstrate that there can be harm done using these preprocessors.

One other method for determining relevant features uses the term frequency-inverse document frequency (tf-idf) statistic (Amati and Van R 2002). Here the goal is to find words or terms that are important to individual documents in the collection of documents at hand. For example, if the words in this book were processed, there are some that would have high frequency (e.g., predictor, encode, or resample) but are unusual in most other contexts.

For a word W in document D, the term frequency, tf, is the number of times that W is contained in D (usually adjusted for the length of D). The inverse document frequency, idf, is a weight that will normalize by how often the word occurs in the current collection of documents. As an example, suppose the term frequency of the word feynman in a specific profile was 2. In the sample of profiles used to derive the features, this word occurs 55 times out of 10,000. The inverse document frequency is usually represented as the log of the ratio, or \(log_2(10000/55)\) or 7.5. The tf-idf value for feynman in this profile would be \(2\times\log_2(10000/55)\) or 15. However, suppose that in the same profile, the word internet also occurs twice. This word is more prevalent across profiles and is contained at least once 1529 times. The tf-idf value here is 5.4; in this way the same raw count is down-weighted according to its abundance in the overall data set. The word feynman occurs the same number of times as internet in this hypothetical profile, but feynman is more distinctive or important (as measured by tf-idf) because it is rarer overall, and thus possibly more effective as a feature for prediction.

One potential issue with td-idf in predictive models is the notion of the “current collection of documents”. This can be computed for the training set but what should be done when a single new sample is being predicted (i.e., there is no collection)? One approach is to use the overall idf values from the training set to weight the term frequency found in new samples.

Additionally, all of the previous approaches have considered a single word at a time. Sequences of consecutive words can also be considered and might provide additional information. n-grams are terms that are sequences of n consecutive words. One might imagine that the sequence “I hate computers” would be scored differently than the simple term “computer” when predicting whether a profile corresponds to a STEM profession.


  1. https://en.wikipedia.org/wiki/Firefly_(TV_series)

  2. https://en.wikipedia.org/wiki/Neal_Stephenson

  3. This was the feature set that was used in Chapter 3 when resampling and model comparisons were discussed in Table 3.4.