11.1 Illustrative Data: Predicting Parkinson’s Disease

Sakar et al. (2019) describes an experiment where a group of 252 patients, 188 of whom had a previous diagnosis of Parkinson’s disease, were recorded speaking a particular sound three separate times. Several signal processing techniques were then applied to each replicate to create 750 numerical features. The objective was to use the features to classify patients’ Parkinson’s disease status. Groups of features within these data could be considered as following a profile (Chapter 9), since many of the features consisted of related sets of fields produced by each type of signal processor (e.g., across different sound wavelengths or sub-bands). Not surprisingly, the resulting data have extreme amount of multicollinearity; about 10,750 pairs of predictors have absolute rank correlations greater than 0.75.

To illustrate the tools in this chapter, we used a stratified random sample based on patient disease status to allocate 25% of the data to the test set. The resulting training set consisted of 189 patients, 138 of which had the disease. The performance metric to evaluate models was the area under the ROC curve.