Feature Engineering and Selection: A Practical Approach for Predictive Models
Notes to readers:
A note to readers: this text is a work in progress. It will eventually be published in this format as well as a more traditional physical medium by Chapman & Hall/CRC.
We’ve released this initial version to get more feedback beyond what our excellent reviewers and editor have already provided. Feedback can be given at the GitHub repo
https://github.com/topepo/FES/issues. Copyediting has not been done yet so read at your own risk. Right now, we are primarily interested in the quality and organization of the content but are open to all of your thoughts.
Code and data will be provided but not until everything has been finalized. That might be frustrating but we’d rather wait.
Thanks for taking the time to read this.
Changes since the 2018-09-09 Release
- All chapters are in the current release.
- The OkCupid data were reprocessed and there were slight changes in the results that affected Section 3.7.
- Chapter 2 has been reworked with more feature selection bits.
- The results for the linear projection methods in Section 6.3.1 have changed. New NNMF software was used and a better data set on the L stations enabled more components in plots.
- Chapter 7 has also been substantially changed so that the Ames data are used instead of the stroke data.
- New figures were added to Chapter 8 to show how PCA can be used to find missing data patterns across rows and columns.
- The chapter “Feature Engineering Without Overfitting” was removed. The points that we were to make in this chapter were already added to various sections throughout the book. To reduce redundancy, the chapter was removed and a small section in Chapter 3 was added.
- In the HTML version, pages are sections now (rather than chapters). This decreases the page load time.
- The data sets are now available at
http://bit.ly/FES-data. There are links at the end of each chapter for the location of the R scripts. We will start releasing code once the content has been finalized so these links will not work until those files are released.
(back to the book)
The goal of our previous work, Applied Predictive Modeling, was to elucidate a framework for constructing models that generate accurate predictions for future, yet-to-be-seen data. This framework includes pre-processing the data, splitting the data into training and testing sets, selecting an approach for identifying optimal tuning parameters, building models, and estimating predictive performance. This approach protects from overfitting to the training data and helps models to identify truly predictive patterns that are generalizable to future data, thus enabling good predictions for that data. Authors and modelers have successfully used this framework to derive models that have won Kaggle competitions (Raimondi 2010), have been implemented in diagnostic tools (Jahani and Mahdavi 2016,Luo (2016)), are being used as the backbone of investment algorithms (Stanković, Marković, and Stojanović 2015), and are being used as a screening tool to assess the safety of new pharmaceutical products (Thomson et al. 2011).
In addition to having a good approach to the modeling process, building an effective predictive model requires other good practices. These practices include garnering expert knowledge about the process being modeled, collecting the appropriate data to answer the desired question, understanding the inherent variation in the response and taking steps, if possible, to minimize this variation, ensuring that the predictors collected are relevant for the problem, and utilizing a range of model types to have the best chance of uncovering relationships among the predictors and the response.
Despite our attempts to follow these good practices, we are sometimes frustrated to find that the best models have less-than-anticipated, below useful predictive performance. This lack of performance may be due to a simple to explain, but difficult to pinpoint, cause: relevant predictors that were collected are represented in a way that models have trouble achieving good performance. Key relationships that are not directly available as predictors may be between the response and:
- a transformation of a predictor,
- an interaction of two or more predictors such as a product or ratio,
- a functional relationship among predictors, or
- an equivalent re-representation of a predictor.
Adjusting and reworking the predictors to enable models to better uncover predictor-response relationships has been termed feature engineering. The engineering connotation implies that we know the steps to take to fix poor performance and to guide predictive improvement. However, we often do not know the best re-representation of the predictors to improve model performance. Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations. Moreover, we may need to search many alternative predictor representations to improve model performance. This process, too, can lead to overfitting due to the vast number of alternative predictor representations. So appropriate care must be taken to avoid overfitting during the predictor creation process.
The goals of Feature Engineering and Selection are to provide tools for re-representing predictors, to place these tools in the context of a good predictive modeling framework, and to convey our experience of utilizing these tools in practice. In the end, we hope that these tools and our experience will help you generate better models. When we started writing this book, we could not find any comprehensive references that described and illustrated the types of tactics and strategies that can be used to improve models by focusing on the predictor representations (that were not solely focused on images and text).
Like in Applied Predictive Modeling, we have used R as the computational engine for this text. There are a few reasons for that. First, while not the only good option, R has been shown to be popular and effective in modern data analysis. Second, R is free and open-source. You can install it anywhere, modify the code, and have the ability to see exactly how computations are performed. Third, it has excellent community support via the canonical R mailing lists and, more importantly, with StackOverflow1 and RStudio Community2. Anyone who asks a reasonable, reproducible question has a pretty good chance of getting an answer.
Also, as in our previous effort, it is critically important to us that all the software and data are freely available. This allows everyone to reproduce our work, find bugs/errors, and to even extend our approaches. The data sets and R code are available in the GitHub repository
We’d like to thank everyone that contributed feedback, typos, or discussions while the book was being written. GitHub contributors included, as of February 22, 2019, the following: @alexpghayes, @AllardJM, @AndrewKostandy, @danielwo, @draben, @eddelbuettel, @endore, @feinmann, @gtesei, @ifellows, @JohnMount, @jonimatix, @juliasilge, @jwillage, @kaliszp, @KevinBretonnelCohen, @KnightAdz, @kransom14, @LG-1, @LluisRamon, @LoweCoryr, @monogenea, @mpettis, @Nathan-Furnal, @nazareno, @PedramNavid, @r0f1, @ronencozen, @shinhongwu, @stecaron, @StefanZaaiman, and @uwesterr.
© 2018 by Taylor & Francis Group, LLC. Except as permitted under U.S. copyright law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by an electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.
Raimondi, C. 2010. “How I Won the Predict HIV Pprogression Data Mining Competition.” http://blog.kaggle.com/2010/08/09/how-i-won-the-hiv-progression-prediction-data-mining-competition/.
Jahani, M, and M Mahdavi. 2016. “Comparison of Predictive Models for the Early Diagnosis of Diabetes.” Healthcare Informatics Research 22 (2):95–100.
Luo, G. 2016. “Automatically Explaining Machine Learning Prediction Results: A Demonstration on Type 2 Diabetes Risk Prediction.” Health Information Science and Systems 4 (1):2.
Stanković, J, I Marković, and M Stojanović. 2015. “Investment Strategy Optimization Using Technical Analysis and Predictive Modeling in Emerging Markets.” Procedia Economics and Finance 19:51–62.
Thomson, J, K Johnson, R Chapin, D Stedman, S Kumpf, and T Ozolinš. 2011. “Not a Walk in the Park: The ECVAM Whole Embryo Culture Model Challenged with Pharmaceuticals and Attempted Improvements with Random Forest Design.” Birth Defects Research Part B: Developmental and Reproductive Toxicology 92 (2):111–21.