A note about this on-line text:

This book is sold by Taylor & Francis Group, who owns the copyright. We will be updating this version as we find errors or typos (see the Errata). The physical copies are sold by Amazon and Taylor & Francis.

The goal of our previous work, Applied Predictive Modeling, was to elucidate a framework for constructing models that generate accurate predictions for future, yet-to-be-seen data. This framework includes pre-processing the data, splitting the data into training and testing sets, selecting an approach for identifying optimal tuning parameters, building models, and estimating predictive performance. This approach protects from overfitting to the training data and helps models to identify truly predictive patterns that are generalizable to future data, thus enabling good predictions for that data. Authors and modelers have successfully used this framework to derive models that have won Kaggle competitions (Raimondi 2010), have been implemented in diagnostic tools (Jahani and Mahdavi 2016; Luo 2016), are being used as the backbone of investment algorithms (Stanković, Marković, and Stojanović 2015), and are being used as a screening tool to assess the safety of new pharmaceutical products (Thomson et al. 2011).

In addition to having a good approach to the modeling process, building an effective predictive model requires other good practices. These practices include garnering expert knowledge about the process being modeled, collecting the appropriate data to answer the desired question, understanding the inherent variation in the response and taking steps, if possible, to minimize this variation, ensuring that the predictors collected are relevant for the problem, and utilizing a range of model types to have the best chance of uncovering relationships among the predictors and the response.

Despite our attempts to follow these good practices, we are sometimes frustrated to find that the best models have less-than-anticipated, less-than-useful useful predictive performance. This lack of performance may be due to a simple to explain, but difficult to pinpoint, cause: relevant predictors that were collected are represented in a way that models have trouble achieving good performance. Key relationships that are not directly available as predictors may be between the response and:

  • a transformation of a predictor,
  • an interaction of two or more predictors such as a product or ratio,
  • a functional relationship among predictors, or
  • an equivalent re-representation of a predictor.

Adjusting and reworking the predictors to enable models to better uncover predictor-response relationships has been termed feature engineering. The engineering connotation implies that we know the steps to take to fix poor performance and to guide predictive improvement. However, we often do not know the best re-representation of the predictors to improve model performance. Instead, the re-working of predictors is more of an art, requiring the right tools and experience to find better predictor representations. Moreover, we may need to search many alternative predictor representations to improve model performance. This process, too, can lead to overfitting due to the vast number of alternative predictor representations. So appropriate care must be taken to avoid overfitting during the predictor creation process.

The goals of Feature Engineering and Selection are to provide tools for re-representing predictors, to place these tools in the context of a good predictive modeling framework, and to convey our experience of utilizing these tools in practice. In the end, we hope that these tools and our experience will help you generate better models. When we started writing this book, we could not find any comprehensive references that described and illustrated the types of tactics and strategies that can be used to improve models by focusing on the predictor representations (that were not solely focused on images and text).

Like in Applied Predictive Modeling, we have used R as the computational engine for this text. There are a few reasons for that. First, while not the only good option, R has been shown to be popular and effective in modern data analysis. Second, R is free and open-source. You can install it anywhere, modify the code, and have the ability to see exactly how computations are performed. Third, it has excellent community support via the canonical R mailing lists and, more importantly, with Twitter,1 StackOverflow,2 and RStudio Community.3 Anyone who asks a reasonable, reproducible question has a pretty good chance of getting an answer.

Also, as in our previous effort, it is critically important to us that all the software and data are freely available. This allows everyone to reproduce our work, find bugs/errors, and to even extend our approaches. The data sets and R code are available in the GitHub repository An HTML version of this text can be found at

We’d like to thank everyone that contributed feedback, typos, or discussions while the book was being written. GitHub contributors included, as of June 21, 2019, the following: @alexpghayes, @AllardJM, @AndrewKostandy, @bashhwu, @btlois, @cdr6934, @danielwo, @davft, @draben, @eddelbuettel, @endore, @feinmann, @gtesei, @ifellows, @JohnMount, @jonimatix, @jrfiedler, @juliasilge, @jwillage, @kaliszp, @KevinBretonnelCohen, @kieroneil, @KnightAdz, @kransom14, @LG-1, @LluisRamon, @LoweCoryr, @lpatruno, @mlduarte, @monogenea, @mpettis, @Nathan-Furnal, @nazareno, @PedramNavid, @r0f1, @Ronen4321, @shinhongwu, @stecaron, @StefanZaaiman, @treysp, @uwesterr, and @van1991. Hadley Wickham also provided excellent feedback and suggestions. Max would also like to thank RStudio for providing the time and support for this activity as well as the wonderful Twitter community. We would also like to thank the reviewers for their work, which helped improve the text. Our excellent editor, John Kimmel, has been a breath of fresh air for us. Most importantly, we would like to thank our families for their support: Louie, Truman, Stefan, Dan, and Valerie.

© 2018 by Taylor & Francis Group, LLC. Except as permitted under U.S. copyright law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by an electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.