Cross-validation in PLS regression

What is cross-validation?

Cross-validation calculates the predictive ability of potential models to help you determine the appropriate number of components to retain in your model. Cross-validation is best if you do not know the optimal number of components. When the data contain multiple response variables, Minitab validates the components for all responses at the same time.

Cross-validation methods

Minitab can perform three different methods for cross-validation:
Leave-one-out
Calculates potential models excluding one observation at a time. For large data sets, this method can be time-consuming, because it recalculates the models as many times as there are observations.
Leave-group-out of size
Calculates the models excluding multiple observations at a time, reducing the number of times it has to recalculate a model. This method is most appropriate when you have a large data set.
Leave out as specified in column
Calculates the models, excluding, at the same time, the observations that have similar numbers in the group identifier column, which you create in the worksheet. This method lets you specify which observations are omitted together. For example, if the group identifier column includes numbers 1, 2, and 3, all observations with 1 are omitted together and the model is recalculated. Next, all observations with 2 are omitted and the model is recalculated, and so on. In this case, the model is recalculated a total of 3 times. The group identifier column must be the same length as your response and predictor columns and cannot contain missing values.

Cross-validation procedure

For each potential model, Minitab does the following:
1. Omits one observation or group of observations, depending on the cross-validation method.
2. Recalculates the model without the observation/group of observations.
3. Predicts the response, or the cross-validated fitted value, for the omitted observation/group of observations using the recalculated model and calculates the cross-validated residual value.
4. Repeats steps 1 - 3 until all observations are omitted and fit.
5. Calculates the prediction sum of squares (PRESS) and predicted R2 values.

After doing steps 1 - 5 for each model, Minitab selects the model with the number of components that produces the highest predicted R2 and lowest PRESS. With multiple response variables, Minitab selects the model with the highest average predicted R2 and lowest average PRESS.

If you do not use cross-validation, Minitab sets the number of components to 10 or to the number of predictors in your model, whichever is less.

Cross-validation statistics

When you do cross-validation, Minitab displays an additional summary table that includes the following statistics:
Cross-validated fitted values

In PLS regression, the cross-validated fitted value is the predicted response for each observation in your data set, calculated individually, so the observation can be excluded from the model used to calculate the predicted response for that observation. The cross-validated fitted values are calculated during cross-validation and vary based on how many observations are omitted each time the model is recalculated.

Use cross validated fitted values to identify how well your model predicts data. Cross-validated fitted values are similar to ordinary fitted values, which identify how well your model fits the data.

Cross-validated residuals

In PLS regression, the cross-validated residuals are the differences between the actual responses and the cross-validated fitted values. The cross-validated residual value varies based on how many observations are omitted each time the model is recalculated during cross-validation.

The residuals measure the model's predictive ability. Minitab uses cross-validated residuals to calculate the PRESS statistic.