Example of Fit Model for TreeNet® Regression

Note

This command is available with the Predictive Analytics Module. Click here for more information about how to activate the module.

A team of researchers wants to use data about a borrower and the location of a property to predict the amount of a mortgage. Variables include the income, race, and gender of the borrower as well as the census tract location of the property, and other information about the borrower and the type of property.

After initial exploration with CART® Regression to identify the important predictors, the team now considers TreeNet® Regression as a necessary follow-up step. The researchers hope to gain more insight into the relationships between the response and the important predictors and predict for new observations with greater accuracy.

These data were adapted based on a public data set containing information on federal home loan bank mortgages. Original data is from fhfa.gov.

  1. Open the sample data set PurchasedMortgages.MTW.
  2. Choose Predictive Analytics Module > TreeNet® Regression > Fit Model.
  3. In Response, enter Loan Amount.
  4. In Continuous predictors, enter Annual IncomeArea Income.
  5. In Categorical predictors, enter First Time Home BuyerCore Based Statistical Area.
  6. Click Validation.
  7. In Validation method, select K-fold cross-validation.
  8. In Number of folds (K), enter 3.
  9. Click OK in each dialog box.

Interpret the results

For this analysis, Minitab grows 300 trees and the optimal number of trees is 300. Because the optimal number of trees is close to the maximum number of trees that the model grows, the researchers repeat the analysis with more trees.

Model Summary

Total predictors34
Important predictors19
Number of trees grown300
Optimal number of trees300
StatisticsTrainingTest
R-squared94.02%84.97%
Root mean squared error (RMSE)32334.558751227.9431
Mean squared error (MSE)1.04552E+092.62430E+09
Mean absolute deviation (MAD)22740.102035974.9695
Mean absolute percent error (MAPE)0.12380.1969

Example with 500 trees

  1. After the model summary table, click Tune Hyperparameters to Identify a Better Model.
  2. In Number of trees, enter 500.
  3. Click Display Results.

Interpret the results

For this analysis, there were 500 trees grown and the optimal number of trees for the combination of hyperparameters with the best value of the accuracy criterion is 500. The subsample fraction changes to 0.7 instead of the 0.5 in the original analysis. The learning rate changes to 0.0437 instead of 0.04372 in the original analysis.

Examine both the Model summary table and the R-squared vs Number of Trees Plot. The R2 value when the number of trees is 500 is 86.79% for the test data and is 96.41% for the training data. These results show improvement over a traditional regression analysis and a CART® Regression.

Method

Loss functionSquared error
Criterion for selecting optimal number of treesMaximum R-squared
Model validation3-fold cross-validation
Learning rate0.04372
Subsample fraction0.5
Maximum terminal nodes per tree6
Minimum terminal node size3
Number of predictors selected for node splittingTotal number of predictors = 34
Rows used4372

Response Information

MeanStDevMinimumQ1MedianQ3Maximum
235217132193238001360002082933007161190000
TreeNet® Regression with Hyperparameter Tuning: Loan Amount vs Annual Income, Income Ratio, Front End Ratio, Back End Ratio, Number of Borrowers, Age, Co-Borrower Age, Tract Minority Percent, Tract Income, Local Income, Area Income, First Time Home Buyer, Occupancy Code, Self-Employed, Co-Borrower Race 4, Co-Borrower Race 5, Loan Purpose, Gender, Number of Units, Ethnicity, Co-Borrower Race 3, Co-Borrower Gender, Race 2, Co-Borrower Ethnicity, Credit Score, Co-Borrower Credit Score, Race, Co-Borrower Race 2, Co-Borrower Race, Property Type, Federal District, State Code, County Code, Core Based Statistical Area

Method

Loss functionSquared error
Criterion for selecting optimal number of treesMaximum R-squared
Model validation3-fold cross-validation
Learning rate0.001, 0.0437, 0.1
Subsample fraction0.5, 0.7
Maximum terminal nodes per tree6
Minimum terminal node size3
Number of predictors selected for node splittingTotal number of predictors = 34
Rows used4372

Response Information

MeanStDevMinimumQ1MedianQ3Maximum
235217132193238001360002082933007161190000

Optimization of Hyperparameters

Test
ModelOptimal
Number
of Trees
R-squared
(%)
Mean Absolute
Deviation
Learning
Rate
Subsample
Fraction
Maximum
Terminal
Nodes
150036.4382617.10.00100.56
249585.8734560.50.04370.56
349585.6334889.30.10000.56
450036.8682145.00.00100.76
5*50086.7933052.60.04370.76
645186.6733262.30.10000.76
* Optimal model has maximum R-squared. Output for the optimal model follows.

Model Summary

Total predictors34
Important predictors19
Number of trees grown300
Optimal number of trees300
StatisticsTrainingTest
R-squared94.02%84.97%
Root mean squared error (RMSE)32334.558751227.9431
Mean squared error (MSE)1.04552E+092.62430E+09
Mean absolute deviation (MAD)22740.102035974.9695
Mean absolute percent error (MAPE)0.12380.1969

The Relative Variable Importance graph plots the predictors in order of their effect on model improvement when splits are made on a predictor over the sequence of trees. The most important predictor variable is Core Based Statistical Area. If the importance of the top predictor variable, Core Based Statistical Area, is 100%, then the next important variable, Annual Income, has a contribution of 92.8%. This means the annual income of the borrower is 92.8% as important as the geographical location of the property.

The scatterplot of fitted loan amounts versus actual loan amounts shows the relationship between the fitted and actual values for both the training data and the test data. You can hover over the points on the graph to see the plotted values more easily. In this example, all points fall approximately near the reference line of y=x.

Use the partial dependency plots to gain insight into how the important variables or pairs of variables affect the fitted response values. The partial dependence plots show whether the relationship between the response and a variable is linear, monotonic, or more complex.

The first plot illustrates the fitted loan amount for each core based statistical area. Because there are so many data points, you can hover over individual data points to see the specific x– and y–values. For instance, the highest point on the right side of the graph is for core area number 41860 and the fitted loan amount is approximately $378069.

The second plot illustrates that the fitted loan amount increases as the annual income increases. After annual income reaches $300000, the fitted loan amount levels increase at a slower rate.

The third plot illustrates that the fitted loan amount increases as the front end ratio increases.

The fourth plot illustrates the fitted loan amount for each census county code. As with the first plot, you can hover over certain data points to get more information. Click Select More Predictors to Plot to produce plots for other variables.