3.1 Illustrative Example: OkCupid Profile Data
OkCupid is an online dating site that serves international users. Kim and Escobedo-Land (2015) describe a data set where over 50,000 profiles from the San Francisco area were made available15 and the data can be found in a
GitHub repository16. The data contains several types of variables:
- open text essays related to an individual’s interests and personal descriptions,
- single choice type fields such as profession, diet, and education, and
- multiple choice fields such as languages spoken and fluency in programming languages.
In their original form, almost all of raw data fields are discrete in nature; only age and the last time since login were numeric. The categorical predictors were converted to dummy variables (Chapter 5) prior to fitting models. For the analyses of these data in this chapter, the open text data will be ignored but will be probed later (Section 5.6). Of the 209 predictors that were used, there were clusters of variables for geographic location (i.e. town, \(p = 3\)), religious affiliation (\(p = 13\)), astrological sign (\(p =15\)), children (\(p =15\)), pets (\(p =15\)), income (\(p =12\)), education (\(p =31\)), diet (\(p =17\)), and over 50 variables related to spoken languages. For more information on these data set and how it was processed, see the book’s GitHub repository.
For this demonstration, the goal will be to predict whether a person’s profession is in the STEM fields (science, technology, engineering, and math). There is a moderate class imbalance in these data; only 18.5% of profiles work in these areas. While the imbalance has a significant impact on the analysis, the illustration presented here will mostly side-step this issue by down-sampling the instances such that the number of profiles in each class are equal. See Chapter 16 of Kuhn and Johnson (2013) for a detailed description of techniques for dealing with infrequent classes.
Kim, A, and A Escobedo-Land. 2015. “OkCupid Data for Introductory Statistics and Data Science Courses.” Journal of Statistics Education 23 (2):1–25.
Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Springer.
While there have been instances where online dating information has been obtained without authorization, these data were made available with permission from OkCupid president and co-founder Christian Rudder. For more information, see the original publication. In these data, no user names or images were made available.↩