This command is available with the Predictive Analytics Module. Click here for more information about how to activate the module.

Minitab Statistical Software offers two methods to rank the importance of the variables.

The permutation method uses the out-of-bag data. For a given tree,
*j*, in the analysis, classify the out-of-bag data with the tree.
Repeat this classification for every tree in the forest. Then, compute the
margin for each row that appears at least once in the out-of-bag data. The
margin is the proportion of votes for the true class minus the maximum
proportion of votes among the other classes. For example, suppose a row is in
class A out of the available classes A, B, and C. The row appears in the
out-of-bag data 100 times with the following classifications:

- A = 87
- B= 9
- C = 4

Then the margin for that row is 0.87 - 0.09 = 0.78.

The average out-of-bag margin is the average margin for all of the rows of data.

To determine the importance of the variable, randomly permute the values
of a variable,
*x*_{m} through the out-of-bag data. Leave the response values
and the other predictor values the same. Then, use the same steps to calculate
the average margin for the permuted data, .

The importance for variable
*x*_{m} comes from the difference of the two averages:

where
is the average margin before the permutation. Minitab rounds values smaller
than 10^{–7} to 0.

Repeat this process for every variable in the analysis. The variable with
the highest importance is the most important variable. The relative variable
importance scores are scaled by the importance of the most important variable:

Any classification tree is a collection of splits. Each split provides improvement to the tree.

The following formula gives the improvement at a single node:

The improvement for a single tree is the sum of the squared improvements
for the individual nodes:

where is the number of nodes that split and for any node where the variable of interest is not the splitter.

The improvement for an entire forest is the sum of the squared importances
across all the trees in the forest:

Where is the number of trees in the forest and is the number of nodes that split in tree .

The calculation of node impurity is similar to the Gini method. For details on the Gini method, go to Node splitting methods in CART® Classification.

The variable with the highest importance is the most important variable.
The relative variable importance scores are scaled by the importance of the
most important variable:

Minitab calculates the average of the negative log-likelihood value
when the response is binary. The calculations depend on the validation method.

The calculation uses the out-of-bag samples from every tree in the forest. Because of the nature of out-of-bag samples, expect to use different combinations of trees to find the contribution to the log-likelihood for each row in the data.

For a given tree in the forest, a class vote for a row in the out-of-bag data is the predicted class for the row from the single tree. The predicted class for a row in out-of-bag data is the class with the highest vote across all trees in the forest. The predicted class probability for a row in the out-of-bag data is the ratio of the number of votes for the class and the total votes for the row. The likelihood calculations follow from these probabilities:

where

and
is the calculated event probability for row
*i* in the out-of-bag data.

Term | Description |
---|---|

n_{Out-of-bag} | number of rows that are out-of-bag at least once |

y_{i}_{,
Out-of-bag} | binary response value of case
i in the out-of-bag data. y_{i, Out-of-bag} = 1 for event
class, and 0 otherwise. |

For a given tree in the forest, a class vote for a row in the test set is the predicted class for the row from the single tree. The predicted class for a row in test set is the class with the highest vote across all trees in the forest. The predicted class probability for a row in the test set is the ratio of the number of votes for the class and the total votes for the row. The likelihood calculations follow from these probabilities:

where

Term | Description |
---|---|

n_{Test} | sample size of the test set |

y_{i}_{, Test} | binary response
value of case
i in the test set. y_{i, k} = 1 for event class, and 0
otherwise. |

predicted event probability for case
i in the test set |

The Model Summary table includes the area under the ROC curve when
the response is binary. The ROC curve plots the true positive rate (TPR), also
known as power, on the y-axis, and the false positive rate (FPR), also known as
type 1 error, on the x-axis. The area under the ROC curve values typically
range from 0.5 to 1.

The area under the curve is a summation of areas of trapezoids:

where
*k* is the number of distinct event probabilities and
(*x*_{0},
*y*_{0}) is the point (0, 0).

To compute the area for a curve from out-of-bag data or a test set, use the points from the corresponding curve.

Term | Description |
---|---|

TPR | true positive rate |

FPR | false positive rate |

TP | true positive, events that were correctly assessed |

FN | false negative, events that were incorrectly assessed |

P | number of actual positive events |

FP | false positive, nonevents that were incorrectly assessed |

N | number of actual negative events |

FNR | false negative rate |

TNR | true negative rate |

For example, suppose your results have 4 distinct fitted values with the
following coordinates on the ROC curve:

x (false positive rate) | y (true positive rate) |
---|---|

0.0923 | 0.3051 |

0.4154 | 0.7288 |

0.7538 | 0.9322 |

1 | 1 |

Then the area under the ROC curve is given by the following calculation:

Minitab calculates a confidence interval for the area under the
Receiver Operating Characteristic curve when the response is binary.

The following interval gives the upper and lower bounds for the confidence interval:

The computation of the standard error of the area under the ROC curve
()
comes from Salford Predictive Modeler^{®}. For general information
about estimation of the variance of the area under the ROC curve, see the
following references:

Engelmann, B. (2011). Measures of a ratings discriminative power: Applications and limitations. In B. Engelmann & R. Rauhmeier (Eds.), The Basel II Risk Parameters: Estimation, Validation, Stress Testing - With Applications to Loan Risk Management (2nd ed.) Heidelberg; New York: Springer. doi:10.1007/978-3-642-16114-8

Cortes, C. and Mohri, M. (2005). Confidence intervals for the area under the ROC curve. Advances in neural information processing systems, 305-312.

Feng, D., Cortese, G., & Baumgartner, R. (2017). A comparison of confidence/credible interval methods for the area under the ROC curve for continuous diagnostic tests with small sample size. Statistical Methods in Medical Research, 26(6), 2603-2621. doi:10.1177/0962280215602040

Term | Description |
---|---|

A | area under the ROC curve |

0.975 percentile of the standard normal distribution |

Minitab displays lift in the model summary table when the response
is binary. The lift in the model summary table is the cumulative lift for 10%
of the data.

To see general calculations for cumulative lift, go to Methods and formulas for the cumulative lift chart for Random Forests® Classification.

The following equation gives the misclassification rate:

The misclassed count is the number of rows in the out-of-bag data where their predicted classes are different from their true classes. Total count is the total number of rows in the out-of-bag data.

For validation with a test data set, the misclassed count is the sum of misclassifications in the test set. Total count is the number of rows in859 the test data set.