Selection of the optimal number of basis functions for MARS® Regression

Note

This command is available with the Predictive Analytics Module. Click here for more information about how to activate the module.

The analysis builds as many basis functions as you specify, with a small modification to the model from the information in each function. If the analysis includes a validation method, then the analysis calculates the value of the model selection criterion for the training data and the test data for each number of basis functions. The optimal value from the test data determines the number of functions in the optimal model.

Model validation methods

Optimization criteria, such as the maximum R2, tend to be optimistic when you calculate them with the same data that you use to fit a model. Model validation methods leave a portion of the data out of the model fitting process, then calculate statistics that evaluate the performance of the model on the omitted data. Model validation techniques provide a better estimate of how well models perform on new data. Depending on your selection of the loss function for the analysis, the criterion is the maximum R2 or the least Mean Absolute Deviation (MAD). Minitab offers two validation methods: k-fold cross-validation and validation with a separate test set.

The optimal model with K-fold cross-validation

K-fold cross-validation is the default method in Minitab when the data have 2000 cases or less. Because the process repeats K times, cross-validation is usually slower than validation with test data.

K-fold cross-validation procedure

To complete K-fold cross-validation, Minitab Statistical Software uses the following steps.
  1. Portion the data into K random subsets of as equal size as possible. The subsets are called folds.
  2. For fold k, k = 1, ..., K, add basis functions using the remaining K–1 folds of data. Calculate the value of the model selection criterion for the model with the data in the kth fold.
  3. Repeat step 2 for all K folds.
  4. Average the values of the model selection criterion across K folds for each number of functions. The number of functions with the best average value makes the optimal model.

The optimal model with a separate test set

In validation with a test set, a portion of the data is set aside for validation. The remaining data is the training set. First, Minitab adds basis functions with the training set. Then, Minitab calculates the values of the model selection criterion for each number of functions using the test set. The number of functions with the best value makes the optimal model.

The optimal model with no validation

Without any validation, Minitab uses the entire data set to fit the model. The final model usually contains the largest number of basis functions.