11 Feature Selection

Goals of feature selection

Abdeldayem, E, A Ibrahim, A Ahmed, E Genedi, and W Tantawy. 2015. “Positive Remodeling Index by Msct Coronary Angiography: A Prognostic Factor for Early Detection of Plaque Rupture and Vulnerability.” The Egyptian Journal of Radiology and Nuclear Medicine 46 (1). Elsevier:13–24.

Abdi, H, and L Williams. 2010. “Principal Component Analysis.” Wiley Interdisciplinary Reviews: Computational Statistics 2 (4):433–59.

Agresti, A. 2012. Categorical Data Analysis. Wiley-Interscience.

Allison, P. 2001. Missing Data. Sage publications.

Altman, D. 1991. “Categorising Continuous Variables.” British Journal of Cancer, no. 5:975.

Altman, D, B Lausen, W Sauerbrei, and M Schumacher. 1994. “Dangers of Using "Optimal" Cutpoints in the Evaluation of Prognostic Factors.” Journal of the National Cancer Institute 86 (11):829–35.

Altman, DG, and JM Bland. 1994a. “Diagnostic tests 3: receiver operating characteristic plots.” BMJ: British Medical Journal 309 (6948):188.

———. 1994b. “Statistics Notes: Diagnostic tests 2: predictive values.” British Medical Journal 309 (6947):102.

Amati, G, and Cornelis J Van R. 2002. “Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness.” ACM Transactions on Information Systems 20 (4):357–89.

Audigier, Vincent, François Husson, and Julie Josse. 2016. “A Principal Component Method to Impute Missing Values for Mixed Data.” Advances in Data Analysis and Classification 10 (1). Springer:5–26.

Bairey, E, E Kelsic, and R Kishony. 2016. “High-Order Species Interactions Shape Ecosystem Diversity.” Nature Communications 7:12285.

Basu, S, K Kumbier, J Brown, and B Yu. 2018. “Iterative Random Forests to Discover Predictive and Stable High-Order Interactions.” Proceedings of the National Academy of Sciences 115 (8):1943–8.

Benavoli, A, G Corani, J Demsar, and M Zaffalon. 2016. “Time for a Change: A Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis.” arXiv.org.

Benjamini, Y, and Y Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 289–300.

Bergstra, J, and Y Bengio. 2012. “Random Search for Hyper-Parameter Optimization.” Journal of Machine Learning Research 13:281–305.

Bien, J, J Taylor, and R Tibshirani. 2013. “A Lasso for Hierarchical Interactions.” Annals of Statistics 41 (3):1111.

Bishop, C. 2011. Pattern Recognition and Machine Learning. Springer.

Boulesteix, AL, and C Strobl. 2009. “Optimal Classifier Selection and Negative Bias in Error Rate Estimation: An Empirical Study on High-Dimensional Prediction.” BMC Medical Research Methodology 9 (1):85.

Box, GEP, and D Cox. 1964. “An Analysis of Transformations.” Journal of the Royal Statistical Society. Series B (Methodological), 211–52.

Box, GEP, W Hunter, and J Hunter. 2005. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. Wiley.

Breiman, L. 1996. “Bagging Predictors.” Machine Learning 24 (2). Springer:123–40.

———. 2001. “Random Forests.” Machine Learning 45 (1). Springer:5–32.

Breiman, L, J Friedman, C Stone, and R Olshen. 1984. Classification and Regression Trees. CRC press.

Breiman, L., J. Friedman, R. Olshen, and C. Stone. 1984. Classification and Regression Trees. New York: Chapman; Hall.

Caputo, B, K Sim, F Furesjo, and A Smola. 2002. “Appearance-Based Object Recognition Using Svms: Which Kernel Should I Use?” In Proceedings of Nips Workshop on Statistical Methods for Computational Experiments in Visual Processing and Computer Vision. Vol. 2002.

Chen, SH, J Sun, L Dimitrov, A Turner, T Adams, D Meyers, BL Chang, et al. 2008. “A Support Vector Machine Approach for Detecting Gene-Gene Interaction.” Genetic Epidemiology 32 (2). Wiley Online Library:152–67.

Chollet, F, and JJ Allaire. 2018. Deep Learning with R. Manning.

Chong, E, and S Żak. 2008. “Global Search Algorithms.” In An Introduction to Optimization, 267–95. John Wiley & Sons, Inc.

Christopher, D, R Prabhakar, and S Hinrich. 2008. Introduction to Information Retrieval. Cambridge University Press.

Cilla, M, E Pena, MA Martinez, and DJ Kelly. 2013. “Comparison of the Vulnerability Risk for Positive Versus Negative Atheroma Plaque Morphology.” Journal of Biomechanics 46 (7). Elsevier:1248–54.

Cleveland, W. 1979. “Robust Locally Weighted Regression and Smoothing Scatterplots.” Journal of the American Statistical Association 74 (368):829–36.

———. 1993. Visualizing Data. Summit, New Jersey: Hobart Press.

Cover, T, and J Thomas. 2012. Elements of Information Theory. John Wiley; Sons.

Davison, A, and D Hinkley. 1997. Bootstrap Methods and Their Application. Cambridge University Press.

Demsar, J. 2006. “Statistical Comparisons of Classifiers over Multiple Data Sets.” Journal of Machine Learning Research 7 (Jan):1–30.

Dickhaus, T. 2014. “Simultaneous Statistical Inference.” AMC 10. Springer:12.

Dillon, W, and M Goldstein. 1984. Multivariate Analysis Methods and Applications. Wiley.

Efron, B. 1983. “Estimating the error rate of a prediction rule: improvement on cross-validation.” Journal of the American Statistical Association, 316–31.

Efron, B, and T Hastie. 2016. Computer Age Statistical Inference. Cambridge University Press.

Efron, B, and R Tibshirani. 1997. “Improvements on cross-validation: The 632+ bootstrap method.” Journal of the American Statistical Association, 548–60.

Eilers, P, and B Marx. 2010. “Splines, Knots, and Penalties.” Wiley Interdisciplinary Reviews: Computational Statistics 2 (6):637–53.

Elith, J, J Leathwick, and T Hastie. 2008. “A Working Guide to Boosted Regression Trees.” Journal of Animal Ecology 77 (4). Wiley Online Library:802–13.

Eskelson, B, H Temesgen, V Lemay, TT Barrett, N Crookston, and A Hudak. 2009. “The Roles of Nearest Neighbor Methods in Imputing Missing Data in Forest Inventory and Monitoring Databases.” Scandinavian Journal of Forest Research 24 (3). Taylor AND Francis:235–46.

Fernandez-Delgado, M, E Cernadas, S Barro, and D Amorim. 2014. “Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?” Journal of Machine Learning Research 15 (1):3133–81.

Fogel, P, D Hawkins, C Beecher, G Luta, and S Young. 2013. “A Tale of Two Matrix Factorizations.” The American Statistician 67 (4):207–18.

Friedman, J. 1991. “Multivariate Adaptive Regression Splines.” The Annals of Statistics 19 (1):1–141.

———. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 1189–1232.

———. 2002. “Stochastic Gradient Boosting.” Computational Statistics & Data Analysis 38 (4). Elsevier:367–78.

Friedman, J, and B Popescu. 2008. “Predictive Learning via Rule Ensembles.” The Annals of Applied Statistics 2 (3):916–54.

Friedman, J, T Hastie, and R Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1):1.

Friendly, M, and D Meyer. 2015. Discrete Data Analysis with R: Visualization and Modeling Techniques for Categorical and Count Data. CRC Press.

Frigge, M, D Hoaglin, and B Iglewicz. 1989. “Some Implementations of the Boxplot.” The American Statistician 43 (1). Taylor & Francis:50–54.

García-Magariños, M, I López-de-Ullibarri, R Cao, and A Salas. 2009. “Evaluating the Ability of Tree-Based Methods and Logistic Regression for the Detection of Snp-Snp Interaction.” Annals of Human Genetics 73 (3). Wiley Online Library:360–69.

Ghosh, A K, and P Chaudhuri. 2005. “On Data Depth and Distribution-Free Discriminant Analysis Using Separating Surfaces.” Bernoulli 11 (1):1–27.

Gillis, N. 2017. “Introduction to Nonnegative Matrix Factorization.” arXiv Preprint arXiv:1703.00663.

Golub, G, M Heath, and G Wahba. 1979. “Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter.” Technometrics 21 (2):215–23.

Goodfellow, I, Y Bengio, and A Courville. 2016. Deep Learning. MIT Press.

Goodfellow, I, Y Bengio, A Courville, and Y Bengio. 2016. Deep Learning. MIT press Cambridge.

Gower, J. 1971. “A General Coefficient of Similarity and Some of Its Properties.” Biometrics, 857–71.

Greenacre, M. 2010. Biplots in Practice. Fundacion BBVA.

———. 2017. Correspondence Analysis in Practice. CRC press.

Guo, C, and F Berkhahn. 2016. “Entity embeddings of categorical variables.” arXiv.org.

Haase, R. 2011. Multivariate General Linear Models. Sage.

Hampel, D, P Andrews, F Bickel, P Rogers, W Huber, and J Turkey. 1972. Robust Estimates of Location. Princeton, New Jersey: Princeton University Press.

Hastie, T, R Tibshirani, and M Wainwright. 2015. Statistical Learning with Sparsity. CRC press.

Hawkins, D. 1994. “The Feasible Solution Algorithm for Least Trimmed Squares Regression.” Computational Statistics & Data Analysis 17 (2):185–96.

Hill, A, P LaPan, Y Li, and S Haney. 2007. “Impact of Image Segmentation on High-Content Screening Data Quality for SK-BR-3 Cells.” BMC Bioinformatics 8 (1):340.

Hintze, J, and R Nelson. 1998. “Violin Plots: A Box Plot-Density Trace Synergism.” The American Statistician 52 (2). Taylor & Francis Group:181–84.

Hoerl, R, Aand Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1):55–67.

Hosmer, D, and S Lemeshow. 2000. Applied Logistic Regression. 2nd ed. New York: John Wiley & Sons.

Hothorn, T, K Hornik, and A Zeileis. 2006. “Unbiased Recursive Partitioning: A Conditional Inference Framework.” Journal of Computational and Graphical Statistics 15 (3). Taylor & Francis:651–74.

Hothorn, T, F Leisch, A Zeileis, and K Hornik. 2005. “The Design and analysis of benchmark experiments.” Journal of Computational and Graphical Statistics 14 (3):675–99.

Hyndman, R, and G Athanasopoulos. 2013. Forecasting: Principles and Practice. OTexts.

Hyvarinen, A, and E Oja. 2000. “Independent Component Analysis: Algorithms and Applications.” Neural Networks 13 (4-5). Elsevier:411–30.

I, Goodfellow., Y Bengio, and A Courville. 2016. Deep Learning. MIT Press.

Jahani, M, and M Mahdavi. 2016. “Comparison of Predictive Models for the Early Diagnosis of Diabetes.” Healthcare Informatics Research 22 (2):95–100.

Jones, D, M Schonlau, and W Welch. 1998. “Efficient Global Optimization of Expensive Black-Box Functions.” Journal of Global Optimization 13 (4). Springer:455–92.

Karthikeyan, M, R Glen, and A Bender. 2005. “General Melting Point Prediction Based on a Diverse Compound Data Set and Artificial Neural Networks.” Journal of Chemical Information and Modeling 45 (3):581–90.

Kenny, P, and C Montanari. 2013. “Inflation of Correlation in the Pursuit of Drug-Likeness.” Journal of Computer-Aided Molecular Design 27 (1):1–13.

Kim, A, and A Escobedo-Land. 2015. “OkCupid Data for Introductory Statistics and Data Science Courses.” Journal of Statistics Education 23 (2):1–25.

Kuhn, M. 2008. “The caret Package.” Journal of Statistical Software 28 (5):1–26.

Kuhn, M, and K Johnson. 2013. Applied Predictive Modeling. Vol. 26. Springer.

Kvalseth, T. 1985. “Cautionary Note About \(R^2\).” American Statistician 39 (4):279–85.

Lambert, J, L Gong, CF Elliot, K Thompson, and A Stromberg. 2018. “An R Package for Finding Best Subsets and Interactions.” The R Journal.

Lampa, E, L Lind, P Lind, and A Bornefalk-Hermansson. 2014. “The Identification of Complex Interactions in Epidemiology and Toxicology: A Simulation Study of Boosted Regression Trees.” Environmental Health 13 (1). BioMed Central:57.

Lawrence, I, and K Lin. 1989. “A Concordance Correlation Coefficient to Evaluate Reproducibility.” Biometrics, 255–68.

Lee, T-W. 1998. Independent Component Analysis. Springer.

Levinson, M, and D Rodriguez. 1998. “Endarterectomy for Preventing Stroke in Symptomatic and Asymptomatic Carotid Stenosis. Review of Clinical Trials and Recommendations for Surgical Therapy.” In The Heart Surgery Forum, 147–68.

Lewis, D, Y Yang, T Rose, and F Li. 2004. “Rcv1: A New Benchmark Collection for Text Categorization Research.” Journal of Machine Learning Research 5:361–97.

Lian, K, J White, E Bartlett, A Bharatha, R Aviv, A Fox, and S Symons. 2012. “NASCET Percent Stenosis Semi-Automated Versus Manual Measurement on Cta.” The Canadian Journal of Neurological Sciences 39 (03). Cambridge Univ Press:343–46.

Little, R, and D Rubin. 2014. Statistical Analysis with Missing Data. John Wiley; Sons.

Luo, G. 2016. “Automatically Explaining Machine Learning Prediction Results: A Demonstration on Type 2 Diabetes Risk Prediction.” Health Information Science and Systems 4 (1):2.

MacKay, D. 2003. Information Theory, Inference and Learning Algorithms. Cambridge University Press.

Massy, W. 1965. “Principal Components Regression in Exploratory Statistical Research.” Journal of the American Statistical Association 60 (309). Taylor & Francis:234–56.

McElreath, R. 2015. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Chapman; Hall/CRC.

McElreath, R. 2016. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Racon Hall: Chapman; Hall.

Meier, P, G Knapp, U Tamhane, S Chaturvedi, and H Gurm. 2010. “Short Term and Intermediate Term Comparison of Endarterectomy Versus Stenting for Carotid Artery Stenosis: Systematic Review and Meta-Analysis of Randomised Controlled Clinical Trials.” BMJ 340. British Medical Journal Publishing Group:c467.

Micci-Barreca, D. 2001. “A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems.” ACM SIGKDD Explorations Newsletter 3 (1):27–32.

Miller, A. 1984. “Selection of Subsets of Regression Variables.” Journal of the Royal Statistical Society. Series A (General), 389–425.

Mockus, J. 1994. “Application of Bayesian Approach to Numerical Methods of Global and Stochastic Optimization.” Journal of Global Optimization 4 (4). Springer:347–65.

Mozharovskyi, P, K Mosler, and T Lange. 2015. “Classifying Real-World Data with the Dd \(\alpha\)-Procedure.” Advances in Data Analysis and Classification 9 (3). Springer:287–314.

Nair, V, and G. Hinton. 2010. “Rectified Linear Units Improve Restricted Boltzmann Machines.” In Proceedings of the 27th International Conference on Machine Learning, edited by J Fürnkranz and T Joachims, 807–14. Omnipress.

Neter, J, M Kutner, C Nachtsheim, and W Wasserman. 1996. Applied Linear Statistical Models. Vol. 4. Irwin Chicago.

Preneel, B. 2010. “Cryptographic Hash Functions: Theory and Practice.” In ICICS, 1–3.

Qi, Y. 2012. “Random Forest for Bioinformatics.” In Ensemble Machine Learning, 307–23. Springer.

Quinlan, R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers.

Raimondi, C. 2010. “How I Won the Predict Hiv Progression Data Mining Competition.” http://blog.kaggle.com/2010/08/09/how-i-won-the-hiv-progression-prediction-data-mining-competition/.

Reid, R. 2015. “A Morphometric Modeling Approach to Distinguishing Among Bobcat, Coyote and Gray Fox Scats.” Wildlife Biology 21 (5). BioOne:254–62.

Roberts, S, and R Everson. 2001. Independent Component Analysis: Principles and Practice. Cambridge University Press.

Rousseeuw, P, and C Croux. 1993. “Alternatives to the Median Absolute Deviation.” Journal of the American Statistical Association 88 (424):1273–83.

S, Nitish, Geoffrey H, Alex K, Ilya S, and Ruslan S. 2014. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting.” Journal of Machine Learning Research 15:1929–58.

Schofield, A, and D Mimno. 2016. “Comparing Apples to Apple: The Effects of Stemmers on Topic Models.” Transactions of the Association for Computational Linguistics 4:287–300.

Schofield, A, M Magnusson, and D Mimno. 2017. “Understanding Text Pre-Processing for Latent Dirichlet Allocation.” In Proceedings of the 15th Conference of the European chapter of the Association for Computational Linguistics, 2:432–36.

Schölkopf, B, A Smola, and KR Müller. 1998. “Nonlinear Component Analysis as a Kernel Eigenvalue Problem.” Neural Computation 10 (5). MIT Press:1299–1319.

Serneels, S, E De Nolf, and P Van Espen. 2006. “Spatial Sign Preprocessing: A Simple Way to Impart Moderate Robustness to Multivariate Estimators.” Journal of Chemical Information and Modeling 46 (3):1402–9.

Shaffer, J. 1995. “Multiple Hypothesis Testing.” Annual Review of Psychology 46 (1):561–84.

Shao, J. 1993. “Linear Model Selection by Cross-Validation.” Journal of the American Statistical Association 88 (422):486–94.

Shawe-Taylor, J, and N Cristianini. 2004. Kernel Methods for Pattern Analysis. Cambridge University Press.

Silge, J, and D Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly.

Stanković, J, I Marković, and M Stojanović. 2015. “Investment Strategy Optimization Using Technical Analysis and Predictive Modeling in Emerging Markets.” Procedia Economics and Finance 19:51–62.

Stekhoven, D, and P Buhlmann. 2011. “MissForest — Non-Parametric Missing Value Imputation for Mixed-Type Data.” Bioinformatics 28 (1). Oxford University Press:112–18.

Stone, M, and R Brooks. 1990. “Continuum Regression: Cross-Validated Sequentially Constructed Prediction Embracing Ordinary Least Squares, Partial Least Squares and Principal Components Regression.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 237–69.

Strobl, C, AL Boulesteix, A Zeileis, and T Hothorn. 2007. “Bias in Random Forest Variable Importance Measures: Illustrations, Sources and a Solution.” BMC Bioinformatics 8 (1):25.

Thomson, J, Johnson K, Chapin R, Stedman D, Kumpf S, and Ozolinš T. 2011. “Not a Walk in the Park: The Ecvam Whole Embryo Culture Model Challenged with Pharmaceuticals and Attempted Improvements with Random Forest Design.” Birth Defects Research Part B: Developmental and Reproductive Toxicology 92 (2):111–21.

Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society. Series B (Methodological), 267–88.

Timm, N, and J Carlson. 1975. “Analysis of Variance Through Full Rank Models.” Multivariate Behavioral Research Monographs. Society of Multivariate Experimental Psychology.

Tufte, E. 1990. Envisioning Information. Cheshire, Connecticut: Graphics press.

Tukey, John W. 1977. Exploratory Data Analysis. Reading, Mass.

Tutz, G, and S Ramzan. 2015. “Improved Methods for the Imputation of Missing Data by Nearest Neighbor Methods.” Computational Statistics and Data Analysis 90. Elsevier:84–99.

U.S. Energy Information Administration. 2017a. “Weekly Chicago All Grades All Formulations Retail Gasoline Prices.” https://tinyurl.com/ydctltn4.

———. 2017b. “What Drives Crude Oil Prices?” https://tinyurl.com/supply-opec.

United States Census Bureau. 2017. “Chicago Illinois Population Estimates.” https://tinyurl.com/y8s2y4bh.

Van Buuren, S. 2012. Flexible Imputation of Missing Data. Chapman; Hall/CRC.

Weinberger, K, A Dasgupta, J Langford, A Smola, and J Attenberg. 2009. “Feature Hashing for Large Scale Multitask Learning.” In Proceedings of the 26th Annual International Conference on Machine Learning, 1113–20. ACM.

West, K, Band Welch, and A Galecki. 2014. Linear Mixed Models: A Practical Guide Using Statistical Software. CRC Press.

Wickham, H, and G Grolemund. 2016. R for Data Science. O’Reilly. http://​r4ds.​had.​co.​nz.

Willett, P. 2006. “The Porter Stemming Algorithm: Then and Now.” Program 40 (3):219–23.

Wolpert, D. 1996. “The Lack of a Priori Distinctions Between Learning Algorithms.” Neural Computation 8 (7). MIT Press:1341–90.

Wood, S. 2006. Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC.

———. 2017. Generalized Additive Models: An Introduction with R. CRC press.

Wu, CF Jeff, and Michael S Hamada. 2011. Experiments: Planning, Analysis, and Optimization. John Wiley & Sons.

Yandell, B. 1993. “Smoothing Splines - a Tutorial.” The Statistician, 317–19.

Yeo, I-K, and R Johnson. 2000. “A New Family of Power Transformations to Improve Normality or Symmetry.” Biometrika 87 (4):954–59.

Zumel, N., and J. Mount. 2016. “vtreat: a data.frame processor for predictive modeling.” arXiv.org.