This command is available with the Predictive Analytics Module. Click here for more information about how to activate the module.
A team of researchers collects and publishes detailed information about factors that affect heart disease. Variables include age, sex, cholesterol levels, maximum heart rate, and more. This example is based on a public data set that gives detailed information about heart disease. The original data are from archive.ics.uci.edu.
After initial exploration with CART® Classification to identify the important predictors, the researchers use both TreeNet® Classification and Random Forests® Classification to create more intensive models from the same data set. The researchers compare the model summary table and the ROC plot from the results to evaluate which model provides a better prediction outcome. For results from the other analyses, go to Example of CART® Classification and Example of Random Forests® Classification.
For this analysis, Minitab grows 300 trees and the optimal number of trees is 298. Because the optimal number of trees is close to the maximum number of trees that the model grows, the researchers repeat the analysis with more trees.
Total predictors | 13 |
---|---|
Important predictors | 13 |
Number of trees grown | 300 |
Optimal number of trees | 298 |
Statistics | Training | Test |
---|---|---|
Average -loglikelihood | 0.2556 | 0.3881 |
Area under ROC curve | 0.9796 | 0.9089 |
95% CI | (0.9664, 0.9929) | (0.8759, 0.9419) |
Lift | 2.1799 | 2.1087 |
Misclassification rate | 0.0891 | 0.1617 |
For this analysis, there were 500 trees grown and the optimal number of trees is 351. The best model uses a learning rate of 0.01, uses a subsample fraction of 0.5, and uses 6 as the maximum number of terminal nodes.
Criterion for selecting optimal number of trees | Maximum loglikelihood |
---|---|
Model validation | 5-fold cross-validation |
Learning rate | 0.01 |
Subsample selection method | Completely random |
Subsample fraction | 0.5 |
Maximum terminal nodes per tree | 6 |
Minimum terminal node size | 3 |
Number of predictors selected for node splitting | Total number of predictors = 13 |
Rows used | 303 |
Variable | Class | Count | % |
---|---|---|---|
Heart Disease | Yes (Event) | 139 | 45.87 |
No | 164 | 54.13 | |
All | 303 | 100.00 |
Criterion for selecting optimal number of trees | Maximum loglikelihood |
---|---|
Model validation | 5-fold cross-validation |
Learning rate | 0.001, 0.01, 0.1 |
Subsample fraction | 0.5, 0.7 |
Maximum terminal nodes per tree | 6 |
Minimum terminal node size | 3 |
Number of predictors selected for node splitting | Total number of predictors = 13 |
Rows used | 303 |
Variable | Class | Count | % |
---|---|---|---|
Heart Disease | Yes (Event) | 139 | 45.87 |
No | 164 | 54.13 | |
All | 303 | 100.00 |
Model | Optimal Number of Trees | Average -Loglikelihood | Area Under ROC Curve | Misclassification Rate | Learning Rate | Subsample Fraction | Maximum Terminal Nodes |
---|---|---|---|---|---|---|---|
1 | 500 | 0.542902 | 0.902956 | 0.171749 | 0.001 | 0.5 | 6 |
2* | 351 | 0.386536 | 0.908920 | 0.175027 | 0.010 | 0.5 | 6 |
3 | 33 | 0.396555 | 0.900782 | 0.161694 | 0.100 | 0.5 | 6 |
4 | 500 | 0.543292 | 0.894178 | 0.178142 | 0.001 | 0.7 | 6 |
5 | 374 | 0.389607 | 0.906620 | 0.165082 | 0.010 | 0.7 | 6 |
6 | 39 | 0.393382 | 0.901399 | 0.174973 | 0.100 | 0.7 | 6 |
Total predictors | 13 |
---|---|
Important predictors | 13 |
Number of trees grown | 500 |
Optimal number of trees | 351 |
Statistics | Training | Test |
---|---|---|
Average -loglikelihood | 0.2341 | 0.3865 |
Area under ROC curve | 0.9825 | 0.9089 |
95% CI | (0.9706, 0.9945) | (0.8757, 0.9421) |
Lift | 2.1799 | 2.1087 |
Misclassification rate | 0.0759 | 0.1750 |
Total predictors | 13 |
---|---|
Important predictors | 13 |
Statistics | Out-of-Bag |
---|---|
Average -loglikelihood | 0.4004 |
Area under ROC curve | 0.9028 |
95% CI | (0.8693, 0.9363) |
Lift | 2.1079 |
Misclassification rate | 0.1848 |
The Model summary table shows that the average negative loglikelihood when the number of trees is 351 is approximately 0.23 for the training data and is approximately 0.39 for the test data. These statistics indicate a similar model to what Minitab Random Forests® creates. Also, the misclassification rates are similar.
Predicted Class (Training) | |||||||
---|---|---|---|---|---|---|---|
Predicted Class (Test) | |||||||
Actual Class | Count | Yes | No | % Correct | Yes | No | % Correct |
Yes (Event) | 139 | 124 | 15 | 89.21 | 110 | 29 | 79.14 |
No | 164 | 8 | 156 | 95.12 | 24 | 140 | 85.37 |
All | 303 | 132 | 171 | 92.41 | 134 | 169 | 82.51 |
Statistics | Training (%) | Test (%) |
---|---|---|
True positive rate (sensitivity or power) | 89.21 | 79.14 |
False positive rate (type I error) | 4.88 | 14.63 |
False negative rate (type II error) | 10.79 | 20.86 |
True negative rate (specificity) | 95.12 | 85.37 |
The confusion matrix shows how well the model separates the classes correctly. In this example, the probability that an event is predicted correctly is 79.14%. The probability that a nonevent is predicted correctly is 85.37%.
Training | Test | ||||
---|---|---|---|---|---|
Actual Class | Count | Misclassed | % Error | Misclassed | % Error |
Yes (Event) | 139 | 15 | 10.79 | 29 | 20.86 |
No | 164 | 8 | 4.88 | 24 | 14.63 |
All | 303 | 23 | 7.59 | 53 | 17.49 |
The misclassification rate helps indicate whether the model will predict new observations accurately. For prediction of events, the test misclassification error is 20.86%. For the prediction of nonevents, the misclassification error is 14.63% and for overall, the misclassification error is 17.49%.