Missing values are common occurrences in data. Unfortunately, most predictive modeling techniques cannot handle any missing values. Therefore, this problem must be addressed prior to modeling. Missing data may occur due to random chance or due to a systematic cause. Understanding the nature of the missing values can help to guide the decision process about how best to remove or impute the data.
One of the best ways to understand the amount and nature of missing values is through an appropriate visualization. For smaller data sets, a heatmap or co-occurrence plot is helpful. Larger data sets can be visualized by plotting the first two scores from a PCA model of the missing data indicator matrix.
Once the severity of missing values is known, then a decision needs to be made about how to address these values. When there is a severe amount of missing data within a predictor or a sample, it may be prudent to remove the problematic predictors or samples. Alternatively, if the predictors are qualitative, missing values could be encoded with a category of “missing”.
For small to moderate amounts of missingness, the values can be imputed. There are many types of imputation techniques. A few that are particularly useful when the ultimate goal is to build a predictive model are k-nearest neighbors, bagged trees, and linear models. Each of these methods have a relatively small computation footprint and can be computed quickly which enables them to be included in the resampling process of predictive modeling.