Selection of the optimal number of trees for Fit Model and Discover Key Predictors with TreeNet® Classification

Note

This command is available with the Predictive Analytics Module. Click here for more information about how to activate the module.

Choose the method or formula of your choice.

The analysis builds as many trees as you specify, with a small modification to the model from the information in each tree. If the analysis includes a validation method, then the analysis calculates the value of the model selection criterion for the training data and the test data for each number of trees. The optimal value from the test set determines the number of trees in the optimal model.

Model validation methods

Optimization criteria, such as the maximum loglikelihood, tend to be optimistic when you calculate them with the same data that you use to fit a model. Model validation methods leave a portion of the data out of the model fitting process, then calculate statistics that evaluate the performance of the model on the omitted data. Model validation techniques provide a better estimate of how well models perform on new data. Depending on your selection for the analysis, the criterion is the maximum loglikelihood, the maximum area under the ROC curve, or the minimum misclassification rate. Minitab offers two validation methods: k-fold cross-validation and validation with a separate test set.

The optimal tree with K-fold cross-validation

K-fold cross-validation is the default method in Minitab when the data have 2000 cases or less. Because the process repeats K times, cross-validation is usually slower than validation with a test set.

K-fold cross-validation procedure

To complete K-fold cross-validation, Minitab Statistical Software follows the following steps:
  1. Portion the data into K random subsets of as equal size as possible. The subsets are called folds.
  2. For fold k, k = 1, ..., K, grow the sequence of trees using the remaining K–1 folds of data. Calculate the value of the model selection criterion for each tree with the data in the kth fold.
  3. Repeat step 2 for all K folds.
  4. Average the values of the model selection criterion across K folds for each number of trees. The number of trees with the best average value makes the optimal model.

The optimal tree with a separate test set

In validation with a test set, a portion of the data is set aside for validation. The remaining data are the training set. First, Minitab grows the sequence of trees with the training set. Then, Minitab calculates the values of the model selection criterion for each number of trees using the test set. The number of trees with the best value makes the optimal model.

The optimal tree with no validation

Without any validation, Minitab uses the entire data set to fit the model. The final model contains the largest number of trees.