2.1 Splitting

Before building these models, we will split the data into one set that will be used to develop models, preprocess the predictors, and explore relationships among the predictors and the response (the training set) and another that will be the final arbiter of the predictor set/model combination performance (the test set). To partition the data, the splitting of the orignal data set will be done in a stratified manner by making random splits in each of the outcome classes. This will keep the proportion of stroke patients approximately the same (Table 2.2). In the splitting, 70% of the data were allocated to the training set.

Table 2.2: Distribution of stroke outcome by training and test split.
Data Set Stroke = Yes (n) Stroke = No (n)
Train 51% (45) 49% (44)
Test 51% (19) 49% (18)