This command is available with the Predictive Analytics Module. Click here for more information about how to activate the module.
A team of researchers collects and publishes detailed information about factors that affect heart disease. Variables include age, sex, cholesterol levels, maximum heart rate, and more. This example is based on a public data set that gives detailed information about heart disease. The original data are from archive.ics.uci.edu.
After initial exploration with CART® Classification to identify the important predictors, the researchers use both TreeNet® Classification and Random Forests® Classification to create more intensive models from the same data set. The researchers compare the model summary table and the ROC plot from the results to evaluate which model provides a better prediction outcome. For results from the other analyses, go to Example of CART® Classification and Example of Fit Model with TreeNet® Classification.
For this analysis, the number of observations is 303. Each of the 300 bootstrap samples uses the 303 observations to create a tree. The data includes a good split of nonevents and events.
Model validation | Validation with out-of-bag data |
---|---|
Number of bootstrap samples | 300 |
Sample size | Same as training data size of 303 |
Number of predictors selected for node splitting | Square root of the total number of predictors = 3 |
Minimum internal node size | 2 |
Rows used | 303 |
Variable | Class | Count | % |
---|---|---|---|
Heart Disease | Yes (Event) | 139 | 45.87 |
No | 164 | 54.13 | |
All | 303 | 100.00 |
Total predictors | 13 |
---|---|
Important predictors | 13 |
Statistics | Out-of-Bag |
---|---|
Average -loglikelihood | 0.4004 |
Area under ROC curve | 0.9028 |
95% CI | (0.8693, 0.9363) |
Lift | 2.1079 |
Misclassification rate | 0.1848 |
Total predictors | 13 |
---|---|
Important predictors | 13 |
Number of trees grown | 500 |
Optimal number of trees | 351 |
Statistics | Training | Test |
---|---|---|
Average -loglikelihood | 0.2341 | 0.3865 |
Area under ROC curve | 0.9825 | 0.9089 |
95% CI | (0.9706, 0.9945) | (0.8757, 0.9421) |
Lift | 2.1799 | 2.1087 |
Misclassification rate | 0.0759 | 0.1750 |
The Model summary table shows that the average negative loglikelihood is 0.3994. These statistics indicate a similar model to what TreeNet® Classification creates when growing 500 trees. Also, the misclassification rates are similar.
Predicted Class (Out-of-Bag) | ||||
---|---|---|---|---|
Actual Class | Count | Yes | No | % Correct |
Yes (Event) | 139 | 109 | 30 | 78.42 |
No | 164 | 26 | 138 | 84.15 |
All | 303 | 135 | 168 | 81.52 |
Statistics | Out-of-Bag (%) |
---|---|
True positive rate (sensitivity or power) | 78.42 |
False positive rate (type I error) | 15.85 |
False negative rate (type II error) | 21.58 |
True negative rate (specificity) | 84.15 |
The confusion matrix shows how well the model separates the classes correctly. In this example, the probability that an event is predicted correctly is 78.42%. The probability that a nonevent is predicted correctly is 84.15%.
Out-of-Bag | |||
---|---|---|---|
Actual Class | Count | Misclassed | % Error |
Yes (Event) | 139 | 30 | 21.58 |
No | 164 | 26 | 15.85 |
All | 303 | 56 | 18.48 |
The misclassification rate helps indicate whether the model will predict new observations accurately. For prediction of events, the misclassification error is 21.58%. For the prediction of nonevents, the misclassification error is 15.85% and for overall, the misclassification error is 18.48%.