This command is available with the Predictive Analytics Module. Click here for more information about how to activate the module.
The analysis builds as many trees as you specify, with a small modification to the model from the information in each tree. If the analysis includes a validation method, then the analysis calculates the value of the model selection criterion for the training data and the test data for each number of trees. The optimal value from the test set determines the number of trees in the optimal model.
Optimization criteria, such as the maximum loglikelihood, tend to be optimistic when you calculate them with the same data that you use to fit a model. Model validation methods leave a portion of the data out of the model fitting process, then calculate statistics that evaluate the performance of the model on the omitted data. Model validation techniques provide a better estimate of how well models perform on new data. Depending on your selection for the analysis, the criterion is the maximum loglikelihood, the maximum area under the ROC curve, or the minimum misclassification rate. Minitab offers two validation methods: k-fold cross-validation and validation with a separate test set.
K-fold cross-validation is the default method in Minitab when the data have 2000 cases or less. Because the process repeats K times, cross-validation is usually slower than validation with a test set.
In validation with a test set, a portion of the data is set aside for validation. The remaining data are the training set. First, Minitab grows the sequence of trees with the training set. Then, Minitab calculates the values of the model selection criterion for each number of trees using the test set. The number of trees with the best value makes the optimal model.
Without any validation, Minitab uses the entire data set to fit the model. The final model contains the largest number of trees.