# 11 Feature Selection

Goals of feature selection

Abdeldayem, E, A Ibrahim, A Ahmed, E Genedi, and W Tantawy. 2015. “Positive Remodeling Index by Msct Coronary Angiography: A Prognostic Factor for Early Detection of Plaque Rupture and Vulnerability.” *The Egyptian Journal of Radiology and Nuclear Medicine* 46 (1). Elsevier:13–24.

Abdi, H, and L Williams. 2010. “Principal Component Analysis.” *Wiley Interdisciplinary Reviews: Computational Statistics* 2 (4):433–59.

Agresti, A. 2012. *Categorical Data Analysis*. Wiley-Interscience.

Allison, P. 2001. *Missing Data*. Sage publications.

Altman, D. 1991. “Categorising Continuous Variables.” *British Journal of Cancer*, no. 5:975.

Altman, D, B Lausen, W Sauerbrei, and M Schumacher. 1994. “Dangers of Using "Optimal" Cutpoints in the Evaluation of Prognostic Factors.” *Journal of the National Cancer Institute* 86 (11):829–35.

Altman, DG, and JM Bland. 1994a. “Diagnostic tests 3: receiver operating characteristic plots.” *BMJ: British Medical Journal* 309 (6948):188.

———. 1994b. “Statistics Notes: Diagnostic tests 2: predictive values.” *British Medical Journal* 309 (6947):102.

Amati, G, and Cornelis J Van R. 2002. “Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness.” *ACM Transactions on Information Systems* 20 (4):357–89.

Audigier, Vincent, François Husson, and Julie Josse. 2016. “A Principal Component Method to Impute Missing Values for Mixed Data.” *Advances in Data Analysis and Classification* 10 (1). Springer:5–26.

Bairey, E, E Kelsic, and R Kishony. 2016. “High-Order Species Interactions Shape Ecosystem Diversity.” *Nature Communications* 7:12285.

Basu, S, K Kumbier, J Brown, and B Yu. 2018. “Iterative Random Forests to Discover Predictive and Stable High-Order Interactions.” *Proceedings of the National Academy of Sciences* 115 (8):1943–8.

Benavoli, A, G Corani, J Demsar, and M Zaffalon. 2016. “Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis.” *arXiv.org*.

Benjamini, Y, and Y Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” *Journal of the Royal Statistical Society. Series B (Methodological)*. JSTOR, 289–300.

Bergstra, J, and Y Bengio. 2012. “Random Search for Hyper-Parameter Optimization.” *Journal of Machine Learning Research* 13:281–305.

Bien, J, J Taylor, and R Tibshirani. 2013. “A Lasso for Hierarchical Interactions.” *Annals of Statistics* 41 (3):1111.

Bishop, C. 2011. *Pattern Recognition and Machine Learning*. Springer.

Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and Negative Bias in Error Rate Estimation: An Empirical Study on High-Dimensional Prediction.” *BMC Medical Research Methodology* 9 (1):85.

Box, GEP, and D Cox. 1964. “An Analysis of Transformations.” *Journal of the Royal Statistical Society. Series B (Methodological)*, 211–52.

Box, GEP, W Hunter, and J Hunter. 2005. *Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building*. Wiley.

Breiman, L. 1996. “Bagging Predictors.” *Machine Learning* 24 (2). Springer:123–40.

———. 2001. “Random Forests.” *Machine Learning* 45 (1). Springer:5–32.

Breiman, L, J Friedman, C Stone, and R Olshen. 1984. *Classification and Regression Trees*. CRC press.

Breiman, L., J. Friedman, R. Olshen, and C. Stone. 1984. *Classification and Regression Trees*. New York: Chapman; Hall.

Caputo, B, K Sim, F Furesjo, and A Smola. 2002. “Appearance-Based Object Recognition Using Svms: Which Kernel Should I Use?” In *Proceedings of Nips Workshop on Statistical Methods for Computational Experiments in Visual Processing and Computer Vision*. Vol. 2002.

Chen, SH, J Sun, L Dimitrov, A Turner, T Adams, D Meyers, BL Chang, et al. 2008. “A Support Vector Machine Approach for Detecting Gene-Gene Interaction.” *Genetic Epidemiology* 32 (2). Wiley Online Library:152–67.

Chollet, F, and JJ Allaire. 2018. *Deep Learning with R*. Manning.

Chong, E, and S Żak. 2008. “Global Search Algorithms.” In *An Introduction to Optimization*, 267–95. John Wiley & Sons, Inc.

Christopher, D, R Prabhakar, and S Hinrich. 2008. *Introduction to Information Retrieval*. Cambridge University Press.

Cilla, M, E Pena, MA Martinez, and DJ Kelly. 2013. “Comparison of the Vulnerability Risk for Positive Versus Negative Atheroma Plaque Morphology.” *Journal of Biomechanics* 46 (7). Elsevier:1248–54.

Cleveland, W. 1979. “Robust Locally Weighted Regression and Smoothing Scatterplots.” *Journal of the American Statistical Association* 74 (368):829–36.

———. 1993. *Visualizing Data*. Summit, New Jersey: Hobart Press.

Cover, T, and J Thomas. 2012. *Elements of Information Theory*. John Wiley; Sons.

Davison, A, and D Hinkley. 1997. *Bootstrap Methods and Their Application*. Cambridge University Press.

Demsar, J. 2006. “Statistical Comparisons of Classifiers over Multiple Data Sets.” *Journal of Machine Learning Research* 7 (Jan):1–30.

Dickhaus, T. 2014. “Simultaneous Statistical Inference.” *AMC* 10. Springer:12.

Dillon, W, and M Goldstein. 1984. *Multivariate Analysis Methods and Applications*. Wiley.

Efron, B. 1983. “Estimating the error rate of a prediction rule: improvement on cross-validation.” *Journal of the American Statistical Association*, 316–31.

Efron, B, and T Hastie. 2016. *Computer Age Statistical Inference*. Cambridge University Press.

Efron, B, and R Tibshirani. 1997. “Improvements on cross-validation: The 632+ bootstrap method.” *Journal of the American Statistical Association*, 548–60.

Eilers, P, and B Marx. 2010. “Splines, Knots, and Penalties.” *Wiley Interdisciplinary Reviews: Computational Statistics* 2 (6):637–53.

Elith, J, J Leathwick, and T Hastie. 2008. “A Working Guide to Boosted Regression Trees.” *Journal of Animal Ecology* 77 (4). Wiley Online Library:802–13.

Eskelson, B, H Temesgen, V Lemay, TT Barrett, N Crookston, and A Hudak. 2009. “The Roles of Nearest Neighbor Methods in Imputing Missing Data in Forest Inventory and Monitoring Databases.” *Scandinavian Journal of Forest Research* 24 (3). Taylor AND Francis:235–46.

Fernandez-Delgado, M, E Cernadas, S Barro, and D Amorim. 2014. “Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?” *Journal of Machine Learning Research* 15 (1):3133–81.

Fogel, P, D Hawkins, C Beecher, G Luta, and S Young. 2013. “A Tale of Two Matrix Factorizations.” *The American Statistician* 67 (4):207–18.

Friedman, J. 1991. “Multivariate Adaptive Regression Splines.” *The Annals of Statistics* 19 (1):1–141.

———. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” *Annals of Statistics*, 1189–1232.

———. 2002. “Stochastic Gradient Boosting.” *Computational Statistics & Data Analysis* 38 (4). Elsevier:367–78.

Friedman, J, and B Popescu. 2008. “Predictive Learning via Rule Ensembles.” *The Annals of Applied Statistics* 2 (3):916–54.

Friedman, J, T Hastie, and R Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” *Journal of Statistical Software* 33 (1):1.

Friendly, M, and D Meyer. 2015. *Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data*. CRC Press.

Frigge, M, D Hoaglin, and B Iglewicz. 1989. “Some Implementations of the Boxplot.” *The American Statistician* 43 (1). Taylor & Francis:50–54.

García-Magariños, M, I López-de-Ullibarri, R Cao, and A Salas. 2009. “Evaluating the Ability of Tree-Based Methods and Logistic Regression for the Detection of Snp-Snp Interaction.” *Annals of Human Genetics* 73 (3). Wiley Online Library:360–69.

Ghosh, A K, and P Chaudhuri. 2005. “On Data Depth and Distribution-Free Discriminant Analysis Using Separating Surfaces.” *Bernoulli* 11 (1):1–27.

Gillis, N. 2017. “Introduction to Nonnegative Matrix Factorization.” *arXiv Preprint arXiv:1703.00663*.

Golub, G, M Heath, and G Wahba. 1979. “Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter.” *Technometrics* 21 (2):215–23.

Goodfellow, I, Y Bengio, and A Courville. 2016. *Deep Learning*. MIT Press.

Goodfellow, I, Y Bengio, A Courville, and Y Bengio. 2016. *Deep Learning*. MIT press Cambridge.

Gower, J. 1971. “A General Coefficient of Similarity and Some of Its Properties.” *Biometrics*, 857–71.

Greenacre, M. 2010. *Biplots in Practice*. Fundacion BBVA.

———. 2017. *Correspondence Analysis in Practice*. CRC press.

Guo, C, and F Berkhahn. 2016. “Entity embeddings of categorical variables.” *arXiv.org*.

Haase, R. 2011. *Multivariate General Linear Models*. Sage.

Hampel, D, P Andrews, F Bickel, P Rogers, W Huber, and J Turkey. 1972. *Robust Estimates of Location*. Princeton, New Jersey: Princeton University Press.

Hastie, T, R Tibshirani, and M Wainwright. 2015. *Statistical Learning with Sparsity*. CRC press.

Hawkins, D. 1994. “The Feasible Solution Algorithm for Least Trimmed Squares Regression.” *Computational Statistics & Data Analysis* 17 (2):185–96.

Hill, A, P LaPan, Y Li, and S Haney. 2007. “Impact of Image Segmentation on High-Content Screening Data Quality for SK-BR-3 Cells.” *BMC Bioinformatics* 8 (1):340.

Hintze, J, and R Nelson. 1998. “Violin Plots: A Box Plot-Density Trace Synergism.” *The American Statistician* 52 (2). Taylor & Francis Group:181–84.

Hoerl, R, Aand Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” *Technometrics* 12 (1):55–67.

Hosmer, D, and S Lemeshow. 2000. *Applied Logistic Regression*. 2nd ed. New York: John Wiley & Sons.

Hothorn, T, K Hornik, and A Zeileis. 2006. “Unbiased Recursive Partitioning: A Conditional Inference Framework.” *Journal of Computational and Graphical Statistics* 15 (3). Taylor & Francis:651–74.

Hothorn, T, F Leisch, A Zeileis, and K Hornik. 2005. “The Design and analysis of benchmark experiments.” *Journal of Computational and Graphical Statistics* 14 (3):675–99.

Hyndman, R, and G Athanasopoulos. 2013. *Forecasting: Principles and Practice*. OTexts.

Hyvarinen, A, and E Oja. 2000. “Independent Component Analysis: Algorithms and Applications.” *Neural Networks* 13 (4-5). Elsevier:411–30.

I, Goodfellow., Y Bengio, and A Courville. 2016. *Deep Learning*. MIT Press.

Jahani, M, and M Mahdavi. 2016. “Comparison of Predictive Models for the Early Diagnosis of Diabetes.” *Healthcare Informatics Research* 22 (2):95–100.

Jones, D, M Schonlau, and W Welch. 1998. “Efficient Global Optimization of Expensive Black-Box Functions.” *Journal of Global Optimization* 13 (4). Springer:455–92.

Karthikeyan, M, R Glen, and A Bender. 2005. “General Melting Point Prediction Based on a Diverse Compound Data Set and Artificial Neural Networks.” *Journal of Chemical Information and Modeling* 45 (3):581–90.

Kenny, P, and C Montanari. 2013. “Inflation of Correlation in the Pursuit of Drug-Likeness.” *Journal of Computer-Aided Molecular Design* 27 (1):1–13.

Kim, A, and A Escobedo-Land. 2015. “OkCupid Data for Introductory Statistics and Data Science Courses.” *Journal of Statistics Education* 23 (2):1–25.

Kuhn, M. 2008. “The caret Package.” *Journal of Statistical Software* 28 (5):1–26.

Kuhn, M, and K Johnson. 2013. *Applied Predictive Modeling*. Vol. 26. Springer.

Kvalseth, T. 1985. “Cautionary Note About \(R^2\).” *American Statistician* 39 (4):279–85.

Lambert, J, L Gong, CF Elliot, K Thompson, and A Stromberg. 2018. “An R Package for Finding Best Subsets and Interactions.” *The R Journal*.

Lampa, E, L Lind, P Lind, and A Bornefalk-Hermansson. 2014. “The Identification of Complex Interactions in Epidemiology and Toxicology: A Simulation Study of Boosted Regression Trees.” *Environmental Health* 13 (1). BioMed Central:57.

Lawrence, I, and K Lin. 1989. “A Concordance Correlation Coefficient to Evaluate Reproducibility.” *Biometrics*, 255–68.

Lee, T-W. 1998. *Independent Component Analysis*. Springer.

Levinson, M, and D Rodriguez. 1998. “Endarterectomy for Preventing Stroke in Symptomatic and Asymptomatic Carotid Stenosis. Review of Clinical Trials and Recommendations for Surgical Therapy.” In *The Heart Surgery Forum*, 147–68.

Lewis, D, Y Yang, T Rose, and F Li. 2004. “Rcv1: A New Benchmark Collection for Text Categorization Research.” *Journal of Machine Learning Research* 5:361–97.

Lian, K, J White, E Bartlett, A Bharatha, R Aviv, A Fox, and S Symons. 2012. “NASCET Percent Stenosis Semi-Automated Versus Manual Measurement on Cta.” *The Canadian Journal of Neurological Sciences* 39 (03). Cambridge Univ Press:343–46.

Little, R, and D Rubin. 2014. *Statistical Analysis with Missing Data*. John Wiley; Sons.

Luo, G. 2016. “Automatically Explaining Machine Learning Prediction Results: A Demonstration on Type 2 Diabetes Risk Prediction.” *Health Information Science and Systems* 4 (1):2.

MacKay, D. 2003. *Information Theory, Inference and Learning Algorithms*. Cambridge University Press.

Massy, W. 1965. “Principal Components Regression in Exploratory Statistical Research.” *Journal of the American Statistical Association* 60 (309). Taylor & Francis:234–56.

McElreath, R. 2015. *Statistical Rethinking: A Bayesian Course with Examples in R and Stan*. Chapman; Hall/CRC.

McElreath, R. 2016. *Statistical Rethinking: A Bayesian Course with Examples in R and Stan*. Racon Hall: Chapman; Hall.

Meier, P, G Knapp, U Tamhane, S Chaturvedi, and H Gurm. 2010. “Short Term and Intermediate Term Comparison of Endarterectomy Versus Stenting for Carotid Artery Stenosis: Systematic Review and Meta-Analysis of Randomised Controlled Clinical Trials.” *BMJ* 340. British Medical Journal Publishing Group:c467.

Micci-Barreca, D. 2001. “A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems.” *ACM SIGKDD Explorations Newsletter* 3 (1):27–32.

Miller, A. 1984. “Selection of Subsets of Regression Variables.” *Journal of the Royal Statistical Society. Series A (General)*, 389–425.

Mockus, J. 1994. “Application of Bayesian Approach to Numerical Methods of Global and Stochastic Optimization.” *Journal of Global Optimization* 4 (4). Springer:347–65.

Mozharovskyi, P, K Mosler, and T Lange. 2015. “Classifying Real-World Data with the Dd \(\alpha\)-Procedure.” *Advances in Data Analysis and Classification* 9 (3). Springer:287–314.

Nair, V, and G. Hinton. 2010. “Rectified Linear Units Improve Restricted Boltzmann Machines.” In *Proceedings of the 27th International Conference on Machine Learning*, edited by J Fürnkranz and T Joachims, 807–14. Omnipress.

Neter, J, M Kutner, C Nachtsheim, and W Wasserman. 1996. *Applied Linear Statistical Models*. Vol. 4. Irwin Chicago.

Preneel, B. 2010. “Cryptographic Hash Functions: Theory and Practice.” In *ICICS*, 1–3.

Qi, Y. 2012. “Random Forest for Bioinformatics.” In *Ensemble Machine Learning*, 307–23. Springer.

Quinlan, R. 1993. *C4.5: Programs for Machine Learning*. Morgan Kaufmann Publishers.

Raimondi, C. 2010. “How I Won the Predict Hiv Progression Data Mining Competition.” http://blog.kaggle.com/2010/08/09/how-i-won-the-hiv-progression-prediction-data-mining-competition/.

Reid, R. 2015. “A Morphometric Modeling Approach to Distinguishing Among Bobcat, Coyote and Gray Fox Scats.” *Wildlife Biology* 21 (5). BioOne:254–62.

Roberts, S, and R Everson. 2001. *Independent Component Analysis: Principles and Practice*. Cambridge University Press.

Rousseeuw, P, and C Croux. 1993. “Alternatives to the Median Absolute Deviation.” *Journal of the American Statistical Association* 88 (424):1273–83.

S, Nitish, Geoffrey H, Alex K, Ilya S, and Ruslan S. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” *Journal of Machine Learning Research* 15:1929–58.

Schofield, A, and D Mimno. 2016. “Comparing Apples to Apple: The Effects of Stemmers on Topic Models.” *Transactions of the Association for Computational Linguistics* 4:287–300.

Schofield, A, M Magnusson, and D Mimno. 2017. “Understanding Text Pre-Processing for Latent Dirichlet Allocation.” In *Proceedings of the 15th Conference of the European chapter of the Association for Computational Linguistics*, 2:432–36.

Schölkopf, B, A Smola, and KR Müller. 1998. “Nonlinear Component Analysis as a Kernel Eigenvalue Problem.” *Neural Computation* 10 (5). MIT Press:1299–1319.

Serneels, S, E De Nolf, and P Van Espen. 2006. “Spatial Sign Preprocessing: A Simple Way to Impart Moderate Robustness to Multivariate Estimators.” *Journal of Chemical Information and Modeling* 46 (3):1402–9.

Shaffer, J. 1995. “Multiple Hypothesis Testing.” *Annual Review of Psychology* 46 (1):561–84.

Shao, J. 1993. “Linear Model Selection by Cross-Validation.” *Journal of the American Statistical Association* 88 (422):486–94.

Shawe-Taylor, J, and N Cristianini. 2004. *Kernel Methods for Pattern Analysis*. Cambridge University Press.

Silge, J, and D Robinson. 2017. *Text Mining with R: A Tidy Approach*. O’Reilly.

Stanković, J, I Marković, and M Stojanović. 2015. “Investment Strategy Optimization Using Technical Analysis and Predictive Modeling in Emerging Markets.” *Procedia Economics and Finance* 19:51–62.

Stekhoven, D, and P Buhlmann. 2011. “MissForest — Non-Parametric Missing Value Imputation for Mixed-Type Data.” *Bioinformatics* 28 (1). Oxford University Press:112–18.

Stone, M, and R Brooks. 1990. “Continuum Regression: Cross-Validated Sequentially Constructed Prediction Embracing Ordinary Least Squares, Partial Least Squares and Principal Components Regression.” *Journal of the Royal Statistical Society. Series B (Methodological)*. JSTOR, 237–69.

Strobl, C, AL Boulesteix, A Zeileis, and T Hothorn. 2007. “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.” *BMC Bioinformatics* 8 (1):25.

Thomson, J, Johnson K, Chapin R, Stedman D, Kumpf S, and Ozolinš T. 2011. “Not a Walk in the Park: The Ecvam Whole Embryo Culture Model Challenged with Pharmaceuticals and Attempted Improvements with Random Forest Design.” *Birth Defects Research Part B: Developmental and Reproductive Toxicology* 92 (2):111–21.

Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” *Journal of the Royal Statistical Society. Series B (Methodological)*, 267–88.

Timm, N, and J Carlson. 1975. “Analysis of Variance Through Full Rank Models.” *Multivariate Behavioral Research Monographs*. Society of Multivariate Experimental Psychology.

Tufte, E. 1990. *Envisioning Information*. Cheshire, Connecticut: Graphics press.

Tukey, John W. 1977. *Exploratory Data Analysis*. Reading, Mass.

Tutz, G, and S Ramzan. 2015. “Improved Methods for the Imputation of Missing Data by Nearest Neighbor Methods.” *Computational Statistics and Data Analysis* 90. Elsevier:84–99.

U.S. Energy Information Administration. 2017a. “Weekly Chicago All Grades All Formulations Retail Gasoline Prices.” https://tinyurl.com/ydctltn4.

———. 2017b. “What Drives Crude Oil Prices?” https://tinyurl.com/supply-opec.

United States Census Bureau. 2017. “Chicago Illinois Population Estimates.” https://tinyurl.com/y8s2y4bh.

Van Buuren, S. 2012. *Flexible Imputation of Missing Data*. Chapman; Hall/CRC.

Weinberger, K, A Dasgupta, J Langford, A Smola, and J Attenberg. 2009. “Feature Hashing for Large Scale Multitask Learning.” In *Proceedings of the 26th Annual International Conference on Machine Learning*, 1113–20. ACM.

West, K, Band Welch, and A Galecki. 2014. *Linear Mixed Models: A Practical Guide Using Statistical Software*. CRC Press.

Wickham, H, and G Grolemund. 2016. *R for Data Science*. O’Reilly. http://r4ds.had.co.nz.

Willett, P. 2006. “The Porter Stemming Algorithm: Then and Now.” *Program* 40 (3):219–23.

Wolpert, D. 1996. “The Lack of a Priori Distinctions Between Learning Algorithms.” *Neural Computation* 8 (7). MIT Press:1341–90.

Wood, S. 2006. *Generalized Additive Models: An Introduction with R*. Chapman & Hall/CRC.

———. 2017. *Generalized Additive Models: An Introduction with R*. CRC press.

Wu, CF Jeff, and Michael S Hamada. 2011. *Experiments: Planning, Analysis, and Optimization*. John Wiley & Sons.

Yandell, B. 1993. “Smoothing Splines - a Tutorial.” *The Statistician*, 317–19.

Yeo, I-K, and R Johnson. 2000. “A New Family of Power Transformations to Improve Normality or Symmetry.” *Biometrika* 87 (4):954–59.

Zumel, N., and J. Mount. 2016. “`vtreat`

: a data.frame processor for predictive modeling.” *arXiv.org*.