The R-squared vs Number of Terminal Nodes Plot displays the R2 value for each tree. By default, the initial regression tree is the smallest tree with an R2 value within 1 standard error of the value for the tree that maximizes the R2 value. When the analysis uses cross-validation or a test data set, the R2 value is from the validation sample. The values for the validation sample typically level off and eventually start to decline as the tree grows larger.
Click Select Alternative Tree to open an interactive plot that includes a table of model summary statistics. Use the plot to investigate alternative trees with similar performance.
After you select a tree, investigate the distinctive terminal nodes on the tree diagram. For example, you might be interested in nodes with large means or with small standard deviations. From the detailed view, you can see the mean, standard deviation, and total counts for each node.
Right-click the tree diagram to perform the following interactions:
Nodes continue to split until the terminal nodes cannot be split into further groupings. Explore other nodes to see which variables are most interesting.
Then, Node 2 splits by Frequency of Substance Abuse and Node 8 splits by the Alcohol Use. Terminal Node 17 has the cases for Planned Medication Therapy = 2, Alcohol Use = 1, and Referral Source = 3, 5, 6, 100, 300, 400, 600, 700, or 800. The researchers note that Terminal Node 17 has the highest mean, the smallest standard deviation, and the most cases.
Terminal Node 1 has the smallest mean and a standard deviation of about 4.3. Because the mean of Terminal Node 1 is about 5.9 and the response values cannot be negative, the node statistics suggest that data in Terminal Node 1 are probably right-skewed.
Use the relative variable importance chart to see which predictors are the most important variables to the tree.
Important variables are a primary or surrogate splitters in the tree. The variable with the highest improvement score is set as the most important variable, and the other variables are ranked accordingly. Relative variable importance standardizes the importance values for ease of interpretation. Relative importance is defined as the percent improvement with respect to the most important predictor.
Relative variable importance values range from 0% to 100%. The most important variable always has a relative importance of 100%. If a variable is not in the tree, that variable is not important.