8.6 Special Cases

There are situations where a data point isn’t missing but is also not complete. For example, when measuring the duration of time until an event, it might be known that the duration is at least some time \(T\) (since the event has not occurred). These types of values are referred to as censored⁷⁵. A variety of statistical methods have been developed to analyze this type of data.

Durations are often right censored since the terminating value is not known. In other cases, left censoring can occur. For example, laboratory measurements may have a lower limit of detection, which means that the measuring instrument cannot reliably quantify values below a threshold \(X\). This threshold is usually determined experimentally in the course of developing the measurement system. When a predictor has values below the lower limit of detection, the these values are usually reported as “\(<X\)”. When these data are to be included in a predictive model, there is usually a question of how to handle the censored values. A widely accepted practice is to use the lower limit value of \(X\) as the result. While this would not adversely affect some partitioning models, such as trees or rules, it may have a detrimental impact on other models since it assumes that these are the true values. Censored values affect metrics that measure the variability. Specifically, the variability will be underestimated. This effect is similar in a manner to binning in Section 6.2.2, and the model may overfit to a cluster of data points with the same value.

To mitigate the variability issue, left censored values can be imputed using random uniform values between zero and \(X\). In cases where there is good information about the distribution below \(X\), other random value assignment schemes can be used that better represent the distribution. For example, there may be some scientific or physiological reason that the smallest possible value is greater than zero (but less than \(X\)). While imputing in this fashion adds random noise to the data, it is likely to be preferable to the potential overfitting issues that can occur by assigning a value of \(X\) to the data.

Another atypical situation occurs when the data have a strong time component. In this case, to preserve these characteristics of the data, simple moving average smoothers can be used to impute the data so that any temporal effects are not disrupted. As previously mentioned in Section 6.1, care most be taken on the ends so that the test (or other) data are not used to impute the training set values.

In cases where the data are not defined outside of lower or upper bounds, the data would be considered truncated.↩