You can specify that the optimal tree is the tree with the least
squared error or the tree with the least absolute deviation. The determination
of the tree with the best value of the chosen criterion depends on the
validation method.

For more details on the model validation methods and complexity parameters,
see Breiman, Friedman, Olshen and Stone (1984)^{1}.

Model summary statistics, such as R^{2}, tend to be optimistic when
you calculate them with the same data that you use to fit a model. Model
validation methods leave a portion of the data out of the model fitting
process, then calculate statistics that evaluate the performance of the model
on the omitted data. Model validation techniques provide a better estimate of
how well models perform on new data. Minitab offers two validation methods for
predictive analytics techniques: k-fold cross-validation and validation with a
separate test data set.

K-fold cross validation is the default method in Minitab when the data have 5,000 cases or less. With this method, Minitab portions the data into K subsets. The subsets are called folds. K-fold cross validation tends to work well with data sets that are relatively small compared to data sets that work well with a test data set. Because the process repeats K times, cross-validation is usually slower than validation with a test data set.

To complete k-fold cross-validation, Minitab produces 1 + k sequences of subtrees. One sequence of subtrees, the master sequence, uses the entire training data set. The other k sequences are for the k folds. For each fold, the sequence of subtrees uses (k – 1)/k of the cases in the training data set.

Each sequence consists of a finite sequence of nested subtrees. Each fold
has a finite sequence of complexity parameters
*α*_{d} ≤
*α* ≤
*α*_{d + 1} that correspond to the largest tree and the
subtrees in the sequence. The sequence that is for the full data set has
complexity parameters
*β*_{d
} ≤
*β* ≤
*β*_{d + 1}where
*d* = 0, 1, ... D, where
*β*_{0} is the parameter for the largest tree in the sequence.

For any subtree in the master sequence, assume the corresponding
complexity parameters are
*β*_{d
} and
*β*_{d + 1
}. Let .
Then, Minitab uses this alpha to find the k corresponding subtrees from the k
folds. For each fold, calculate the chosen criterion for the subtree using the
formula in
Methods and formulas for the model summary in CART® Regression.
The average of the criterion across k folds is the estimated value for the
subtree in the master sequence. Repeat the calculation of the criterion for
each subtree in the master sequence. The subtree with the minimum average value
is the optimal tree.

In validation with a test data set, a portion of the data is set aside for validation. This portion of the data is the training data set. First, Minitab fits all the trees for with the training data set. Then, Minitab calculates either the mean square error or the absolute deviation for the test data set for each tree. The tree with the optimal value of the criterion for the test data set is the optimal tree.

Without any validation, Minitab uses the entire data set to grow the sequence of subtrees. The subtree with the most terminal nodes has the least mean square error or the least absolute deviation and is the optimal tree.