5.6 Creating Features from Text Data
Often, data contain textual fields that are gathered from questionnaires, articles, reviews, tweets, and other sources. For example, the OkCupid data contains the responses to nine open text questions, such as “my selfsummary” and “six things I could never do without”. The open responses are likely to contain important information relative to the outcome. Individuals in a STEM field are likely to have a different vocabulary than, say, someone in the humanities. Therefore the words or phrases used in these open text fields could be very important to predicting the outcome. This kind of data is qualitative and requires more effort to put in a form that models can consume. How then can these data be explored and represented for a model?
For example, some profile text answers contained links to external websites. A reasonable hypothesis is that the presence of a hyperlink could be related to the person’s profession. Table 5.4 shows the results for the random subset of profiles. The rate of hyperlinks in the STEM profiles was 21% , while this rate was 11.6% in the nonSTEM profiles. One way to evaluate a difference in two proportions is called the oddsratio (Agresti 2012). First, the odds of an event that occurs with rate \(p\) is defined as \(p/(1p)\). For the STEM profiles, the odds of containing a hyperlink are relatively small with a value of \(0.21/0.79 = 0.266\). For the nonSTEM profiles, it is even smaller (0.132). However, the ratio of these two quantities can be used to understand the effect of having a hyperlink would be between the two professions. In this case, this indicates that the odds of a STEM profile is is nearly 2.0times higher when the profile to contains a link. Basic statistics can be used to assign a lower 95% confidence interval on this quantity. For these data the lower bound is 1.8, which indicates that this increase in the odds is unlikely to be due to random noise since it does not include a value of 1.0. Given these results, the indicator of a hyperlink will likely benefit a model and should be included.
stem  other  

Link  1051  582 
No Link  3949  4418 
Are there words or phrases that would make good predictors of the outcome? To determine this, the text data must first be processed and cleaned. At this point, there were 63,630 distinct words in the subsample of 10,000 profiles. A variety of features were computed on the data such as the number of commas, hashtags, mentions, exclamation points, and so on. This resulted in a set of 13 new “text related” features. In these data, there are an abundance of HTML markup tags in the text, such as <br />
and <a href= \"link\">
. These were removed from the data, as well as punctuation, line breaks, and other symbols. Additionally, there were some words that were nonsensical repeated characters, such as ***
or aaaaaaaaaa
, that were removed. Additional types of preprocessing of text are described below and, for more information, Christopher, Prabhakar, and Hinrich (2008) is a good introduction to the analysis of text data while Silge and Robinson (2017) is an excellent reference on the computational aspects of analyzing such data.
Given this set of 63,630 words and their associated outcome classes, oddsratios can be computed. First, the words were again filtered so that terms with at least 50 occurrences in the 10,000 profiles would be analyzed. This cutoff was determined so that there was a sufficient frequency for modeling. With this constraint the potential number of keywords was reduced to 4,940 terms. For each of these, the oddsratio and associated pvalue were computed. The pvalue tests the hypothesis that the odds of the keyword occurring in either professional group are equal (i.e. 1.0). However, the pvalues can easily provide misleading results for two reasons:
In isolation, the pvalue only relates the the question “Is there a difference in the odds between the two groups?” The more appropriate question is “How much of a difference is there between the groups?” The pvalue does not measure the magnitude of the differences.
When testing a single (predefined) hypothesis, the false positive rate for a hypothesis test using \(\alpha = 0.05\) is 5%. However, there are 4,940 tests on the same data set which raises the false positive rate exponentially. The falsediscovery rate (FDR) pvalue correction (Efron and Hastie 2016) was developed for just this type of scenario. This procedure uses the entire distribution of pvalues and attenuates the false positive rate. If a particular keyword has an FDR value of 0.30, this implies that the collection of keywords with FDR values less than 0.30 have collective 30% false discovery rate. For this reason, the focus here will be on the FDR values generated using the BenjaminiHochberg correction (Benjamini and Hochberg 1995).
To characterize these results, a volcano plot is shown in Figure 5.4 where the estimated oddsratio is shown on the xaxis and the minus log of the FDR values are shown on the yaxis (where larger values indicate higher statistical significance). The size of the points is associated with the number of events found in the sampled data set. Keywords falling in the upper left and rightrand side would indicate a strong difference between the classes and that is unlikely to be random result. The plot shows far more keywords that have a higher likelihood to be found in the STEM profiles than in the nonSTEM professions as more points fall on the upper righthand side of one on the xaxis. Several of these have extremely high levels of statistical significance with FDR values that are vanishingly small.
As a rough criteria for “importance”, keywords with oddsratios of at least 2 (in either direction) and an FDR value less than \(10^{5}\) will be considered for modeling. This results in 50 keywords:
biotech

climbing

code

coding

company

computer

computers

data

developer

ender

engineer

engineering

feynman

firefly

fixing

geek

geeky

im

internet

lab

law

lol

marketing

math

matrix

mechanical

neal

nerd

problems

programmer

programming

robots

science

scientist

silicon

software

solve

solving

startup

stephenson

student

systems

teacher

tech

techie

technical

technology

therapist

web

websites

Of these keywords, only 7 were enriched in the nonSTEM profiles: im
, lol
, law
, teacher
, student
, therapist
, marketing
. Of the STEMenriched keywords, many of these make sense (and play to stereotypes). The majority are related to occupation (e.g. engin
, startup
, and scienc
), while others are clearly related to popular geek culture, such as firefli
^{47}, neal
, and stephenson
^{48}, and scifi
.
One final set of 9 features were computed related to the sentiment and language of the essays. There are curated collections of words with assigned sentiment values. For example, “horrible” has a fairly negative connotation while “wonderful” is associated with positivity. Words can be assigned qualitative assessments (e.g. “positive”, “neutral”, etc.) or numeric scores where neutrality is given a value of zero. In some problems, sentiment might be a good predictor of the outcome. This feature set included sentimentrelated measures, as well as measures of the pointofview (i.e. first, second, or thirdperson text) and other language elements.
Do these features have an impact on a model? A series of logistic regression models were computed using different feature sets:
A basic set of profile characteristics unrelated to the essays (e.g. age, religion, etc.) consisting of 159 predictors (after generating dummy variables). This resulted in a baseline area under the ROC curve of 0.768. The individual resamples for this model are shown in Figure 5.5.
The addition of the simple text features increased the number of predictors to 172. Performance was not appreciably affected as the AUC value was 0.775. These features were discarded from further use.
The keywords were added to the basic profile features. In this case, performance jumped to 0.842. Recall that these were derived form a smaller portion of these data and that this estimate may be slightly optimistic. However, the entire training set was crossvalidated and, if overfitting was severe, the resampled performance estimates would not be very good.
The sentiment and language features were added to this model, producing an AUC of 0.844. This indicates that these aspects of the essays did not provide any predictive value above and beyond what was already captured by the previous features.
Overall, the keyword and basic feature model would be the best version found here^{49}.
The strategy shown here for computing and evaluating features from text is fairly simplistic and is not the only approach that could be taken. Other methods for preprocessing text data include:
 removing commonly used stop words, such as “is”, “the”, “and”, etc.
 stemming the words so that similar words, such as the singular and plural versions, are represented as a single entity.
The SMART lexicon of stop words (Lewis et al. 2004) was filtered out of the current set of words, leaving 63,118 unique results. The words were then “stemmed” using the Porter algorithm (Willett 2006) to convert similar words to a common root. For example, these 7 words are fairly similar: "teach"
, "teacher"
, "teachers"
, "teaches"
, "teachable"
, "teaching"
, "teachings"
. Stemming would reduce these to 3 unique values: "teach"
, "teacher"
, "teachabl"
Once these values were stemmed, there were 45,704 unique words remaining. While this does reduce the potential number of terms to analyze, there are potential drawbacks related to a reduced specificity of the text. Schofield and Mimno (2016) and Schofield, Magnusson, and Mimno (2017) demonstrate that there can be harm done using these preprocessors.
One other method for determining relevant features uses the term frequencyinverse document frequency (tfidf
) statistic (Amati and Van R 2002). Here the goal is to find words or terms that are important to individual documents in the collection of documents at hand. For example, if the words in this book were processed, there are some that would have high frequency (e.g. predictor
, encode
, or resample
) but are unusual in most other contexts.
For a word W
in document D
, the term frequency, tf
, is the number of times that W
is contained in D
(usually adjusted for the length of D
). The inverse document frequency, idf
, is a weight that will normalize by how often the word occurs in the current collection of documents. As an example, suppose the term frequency of the word feynman
in a specific profile was 2. In the sample of profiles used to derive the features, this word occurs 55 times out of 10,000. The inverse document frequency is usually represented as the log of the ratio, or \(log_2(10000/55)\) or 7.5. The tfidf
value for feynman in this profile would be \(2\times\log_2(10000/55)\) or 15. However, suppose that in the same profile, the word internet
also occurs twice. This word is more prevalent across profiles and is contained at least once 1529 times. The tfidf
value here is 5.4; in this way the same raw count is downweighted according to its abundance in the overall data set. The word feynman
occurs the same number of times as internet
in this hypothetical profile, but feynman
is more distinctive or important (as measured by tfidf
) because it is more rare overall, and thus possibly more effective as a feature for prediction.
One potential issue with tdidf
in predictive models is the notion of the “current collection of documents”. This can be computed for the training set but what should be done when a single new sample is being predicted (i.e. there is no collection)? One approach is to use the overall idf
values from the training set to weight the term frequency found in new samples.
Additionally, all of the previous approaches have considered a single word at a time. Sequences of consecutive words can also be considered and might provide additional information. ngrams are terms that are sequences of n consecutive words. One might imagine that the sequence “I hate computers” would be scored differently than the simple term “computer” when predicting whether a profile corresponds to a STEM profession.
References
Agresti, A. 2012. Categorical Data Analysis. WileyInterscience.
Christopher, D, R Prabhakar, and S Hinrich. 2008. Introduction to Information Retrieval. Cambridge University Press.
Silge, J, and D Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly.
Efron, B, and T Hastie. 2016. Computer Age Statistical Inference. Cambridge University Press.
Benjamini, Y, and Y Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 289–300.
Lewis, D, Y Yang, T Rose, and F Li. 2004. “Rcv1: A New Benchmark Collection for Text Categorization Research.” Journal of Machine Learning Research 5:361–97.
Willett, P. 2006. “The Porter Stemming Algorithm: Then and Now.” Program 40 (3):219–23.
Schofield, A, and D Mimno. 2016. “Comparing Apples to Apple: The Effects of Stemmers on Topic Models.” Transactions of the Association for Computational Linguistics 4:287–300.
Schofield, A, M Magnusson, and D Mimno. 2017. “Understanding Text PreProcessing for Latent Dirichlet Allocation.” In Proceedings of the 15th Conference of the European chapter of the Association for Computational Linguistics, 2:432–36.
Amati, G, and Cornelis J Van R. 2002. “Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness.” ACM Transactions on Information Systems 20 (4):357–89.