The goal of our previous work, Applied Predictive Modeling, was to elucidate a framework for constructing models that generate accurate predictions for future, yet-to-be-seen data. This framework includes pre-processing the data, splitting the data into training and testing sets, selecting an approach for identifying optimal tuning parameters, building models, and estimating predictive performance. This approach protects from overfitting to the training data and helps models to identify truly predictive patterns that are generalizable to future data, thus enabling good predictions for that data. Authors and modelers have successfully used this framework to derive models that have won Kaggle competitions (Raimondi 2010), have been implemented in diagnostic tools (Jahani and Mahdavi 2016,Luo (2016)), are being used as the backbone of investment algorithms (Stanković, Marković, and Stojanović 2015), and are being used as a screening tool to assess the safety of new pharmaceutical products (Thomson et al. 2011).
In addition to having a good approach to the modeling process, building an effectively predictive model requires other good practices. These practices include garnering expert knowledge about the process being modeled, collecting the appropriate data to answer the desired question, understanding the inherent variation in the response and taking steps, if possible, to minimize this variation, ensuring that the predictors collected are relevant for the problem, and utilizing a range of model types to have the best chance of uncovering relationships among the predictors and the response.
Despite our attempts to follow these good practices, we are sometimes frustrated to find that the best models have less-than-anticipated, below useful predictive performance. This lack of performance may be due to a simple to explain, but difficult to pinpoint cause: the expert guided, relevant predictors that we collected are not in a form which models can adequately uncover the essential predictive information for the response. Key relationships that are not directly available as predictors may be between the response and:
- a transformation of a predictor,
- an interaction of two or more predictors such as a product or ratio,
- a functional relationship among predictors, or
- an equivalent re-representation of a predictor.
Adjusting and reworking the predictors to enable models to better uncover predictor-response relationships has been termed feature engineering. The engineering connotation implies that we know the steps to take to fix poor performance and to guide predictive improvement. However, we often do not know the best re-representation of the predictors to improve model performance. Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations. Moreover, we may need to search many alternative predictor representations to improve model performance. This process, too, can lead to overfitting due to the vast number of alternative predictor representations. So appropriate care must be taken to avoid overfitting during the predictor creation process.
The goals of Feature Engineering and Selection are to provide tools for re-representing predictors, to place these tools in the context of a good predictive modeling framework, and to convey our experience of utilizing these tools in practice. In the end, we hope that these tools and our experience will help you generate better models. When we started writing this book, we could not find any comprehensive references that described and illustrated the types of tactics and strategies that can be used to improve models by focusing on the predictor representations (that were not solely focused on images and text).
Like in Applied Predictive Modeling, we have used R as the computational engine for this text. There are a few reasons for that. First, while not the only good option, R has been shown to be popular and effective in modern data analysis. Second, R is free and open-source. You can install it anywhere, modify the code, and have the ability to see exactly how computations are performed. Third, it has excellent community support via the canonical R mailing lists and, more importantly, with StackOverflow1 and RStudio Community2. Anyone who asks a reasonable, reproducible question has a pretty good chance of getting an answer.
Also, as in our previous effort, it is critically important to us that all the software and data are freely available. This allows everyone to reproduce our work, find bugs/errors, and to even extend our approaches. The data sets and R code are available in the GitHub repository
Raimondi, Chris. 2010. “How I Won the Predict Hiv Progression Data Mining Competition.” http://blog.kaggle.com/2010/08/09/how-i-won-the-hiv-progression-prediction-data-mining-competition/.
Jahani, Meysam, and Mahdi Mahdavi. 2016. “Comparison of Predictive Models for the Early Diagnosis of Diabetes.” Healthcare Informatics Research 22 (2):95–100.
Luo, Gang. 2016. “Automatically Explaining Machine Learning Prediction Results: A Demonstration on Type 2 Diabetes Risk Prediction.” Health Information Science and Systems 4 (1):2.
Stanković, Jelena, Ivana Marković, and Miloš Stojanović. 2015. “Investment Strategy Optimization Using Technical Analysis and Predictive Modeling in Emerging Markets.” Procedia Economics and Finance 19:51–62.
Thomson, Jason, Kjell Johnson, Robert Chapin, Donald Stedman, Steven Kumpf, and Terence RS Ozolinš. 2011. “Not a Walk in the Park: The Ecvam Whole Embryo Culture Model Challenged with Pharmaceuticals and Attempted Improvements with Random Forest Design.” Birth Defects Research Part B: Developmental and Reproductive Toxicology 92 (2):111–21.