Percent of error statistics due to largest residuals for CART^® Regression

Use the percent of error statistics to examine the amount of error in the tree fits from the worst fits. When the analysis uses a validation technique, you can also compare the accuracy of the tree for the training data and the validation results.

Each row of the table shows the error statistics for the given percentage of residuals. The percent of the Mean Squared Error (MSE) that comes from the largest residuals is usually higher than the percent for the other two statistics. MSE uses the squares of the errors in the calculations, so the most extreme observations typically have the greatest influence on the statistic. Large differences between the percent of error for MSE and the other two measures can indicate that the tree is more sensitive to the selection of splitting the nodes with least squared error or least absolute deviation.

When you use a validation technique, Minitab calculates separate statistics for the training data and for the validation results. You can compare the statistics to examine the relative performance of the model on the training data and on new data. The validation statistics are usually a better measure of how the model will perform for new data.

A possible pattern is that a small percentage of the residuals account for a large portion of the error in the data. For example, in the following table, the total size of the data set is about 4500. From the perspective of the MSE, that indicates that 1% of the data account for about 12% of the error. In such a case, the 45 cases that contribute most of the error to the tree can represent the most natural opportunity to improve the tree. Finding a way to improve the fits for those cases leads to a relatively large increase in the overall performance of the tree.

This condition can also indicate that you can have greater confidence in nodes of the tree that do not have cases with the largest errors. Because most of the error comes from a small number of cases, the fits for the other cases are relatively more accurate.

Percent of Error Statistics Due to Largest Residuals

% of Largest Residuals		Training			Cross-validation
% of Largest Residuals	Count	% MSE	% MAD	% MAPE	% MSE	% MAD	% MAPE
1.0	45	12.0662	4.4286	17.0993	11.7595	4.3601	16.9809
2.0	90	19.6105	7.9590	27.7611	19.0639	7.8242	28.0537
2.5	112	22.6611	9.5292	31.4313	22.0671	9.3775	31.8497
3.0	134	25.4267	11.0245	35.1014	24.7926	10.8576	35.4683
4.0	179	30.3473	13.8759	42.6086	29.7103	13.7003	42.7628
5.0	223	34.5866	16.4938	49.9489	33.9523	16.3116	49.8103
7.5	334	43.2672	22.4419	63.2850	43.0319	22.3750	63.0140
10.0	446	50.4797	27.8875	70.7239	50.3414	27.8406	70.3832
15.0	668	61.1200	37.1919	78.5216	61.0161	37.1327	78.1782
20.0	891	69.2319	45.3354	82.5577	69.0602	45.2227	82.2440

Percent of error statistics due to largest residuals for CART® Regression

Percent of Error Statistics Due to Largest Residuals

Percent of error statistics due to largest residuals for CART^® Regression