Percent of error statistics due to largest residuals for CART® Regression

Use the percent of error statistics to examine the amount of error in the tree fits from the worst fits. When the analysis uses a validation technique, you can also compare the statistics of the tree for the training and test data.

Each row of the table shows the error statistics for the given percentage of residuals. The percent of the Mean Squared Error (MSE) that comes from the largest residuals is usually higher than the percent for the other two statistics. MSE uses the squares of the errors in the calculations, so the most extreme observations typically have the greatest influence on the statistic. Large differences between the percent of error for MSE and the other two measures can indicate that the tree is more sensitive to the selection of splitting the nodes with least squared error or least absolute deviation.

When you use a validation technique, Minitab calculates separate statistics for the training data and for the test data. You can compare the statistics to examine the relative performance of the tree on the training data and on new data. The test statistics are usually a better measure of how the tree will perform for new data.

A possible pattern is that a small percentage of the residuals account for a large portion of the error in the data. For example, in the following table, the total size of the data set is about 4500. From the perspective of the MSE, that indicates that 1% of the data account for about 12% of the error. In such a case, the 45 cases that contribute most of the error to the tree can represent the most natural opportunity to improve the tree. Finding a way to improve the fits for those cases leads to a relatively large increase in the overall performance of the tree.

This condition can also indicate that you can have greater confidence in nodes of the tree that do not have cases with the largest errors. Because most of the error comes from a small number of cases, the fits for the other cases are relatively more accurate.

17 Node CART® Regression: Length of Service versus Age at Admission, Age of First Drug Use, Arrests in Previous 30 Days, Days Waiting for Service, Previous Treatment Episodes, Years of Education, Other Stimulant Use, Planned Medication Therapy, Psychiatric Condition, Pregnant, Gender, Veteran, Alcohol Use, Cocaine Use, Marijuana Use, Heroin Use, Other Opioid Use, PCP Use, Methadone Use, Other Hallucinogen Use, Methamphetamine Use, Other Amphetamine Use, Benzodiazepine Use, Other Tranquilizer Use, Barbituate Use, Other Sedative Use, Inhalant Use, Non-Prescription Drug Use, Other Drug Use, Intravenous Drug Use, Living Arrangements, Frequency of Substance Abuse, Health Insurance, Marital Status, Ethnicity, Income Source, Primary Ingestion Route of Sub, Self-Help Attendance, Source of Payment, Race, Employment Status, Referral Source, Primary Substance of Abuse, DSM Diagnosis

Percent of Error Statistics Due to Largest Residuals % of Largest Training Test Residuals Count % MSE % MAD % MAPE % MSE % MAD % MAPE 1.0 45 12.0662 4.4286 17.0993 11.7595 4.3601 16.9809 2.0 90 19.6105 7.9590 27.7611 19.0639 7.8242 28.0537 2.5 112 22.6611 9.5292 31.4313 22.0671 9.3775 31.8497 3.0 134 25.4267 11.0245 35.1014 24.7926 10.8576 35.4683 4.0 179 30.3473 13.8759 42.6086 29.7103 13.7003 42.7628 5.0 223 34.5866 16.4938 49.9489 33.9523 16.3116 49.8103 7.5 334 43.2672 22.4419 63.2850 43.0319 22.3750 63.0140 10.0 446 50.4797 27.8875 70.7239 50.3414 27.8406 70.3832 15.0 668 61.1200 37.1919 78.5216 61.0161 37.1327 78.1782 20.0 891 69.2319 45.3354 82.5577 69.0602 45.2227 82.2440