Example of Discover Best Model (Continuous Response)

Note

This command is available with the Predictive Analytics Module. Click here for more information about how to activate the module.

Search for the best type of model

Researchers for a healthcare system collect data from their regional medical clinics. In particular, the research team is interested in data from doctors' initial examinations of sick patients. At the end of the initial examinations, the doctors assign each patient a score for the severity of their illness. The researchers want to develop a short questionnaire to help prioritize the sickest patients before examination by a doctor. Through consultation with subject matter experts and initial exploration of the data, the team selects 8 variables to predict the severity score. The researchers want to determine the best type of model to predict the severity score before they further refine the model.

The researchers use Discover Best Model (Continuous Response) to compare the predictive performance of 5 types of models: multiple regression, TreeNet®, Random Forests® CART® and MARS®. The team plans to further explore the type of model with the best predictive performance.

  1. Open the sample data, Illness.mtw.
  2. Choose Predictive Analytics Module > Automated Machine Learning > Discover Best Model (Continuous Response).
  3. In Response, enter 'Illness Severity Score'.
  4. In Continuous predictors, enter 'Number of Symptoms Now'.
  5. In Categorical predictors, enter 'High Production of Phlegm'-'Limits on Normal Activities'.
  6. Click OK.

Interpret the results

The Model Selection table compares the performance of the types of models. The multiple regression model has the maximum value of R2. The results that follow are for the best multiple regression model.

To determine whether the association between the response and each term in the model is statistically significant, compare the p-value for the term to your significance level to assess the null hypothesis. The null hypothesis is that there is no association between the term and the response. Usually, a significance level (denoted as α or alpha) of 0.05 works well. A significance level of 0.05 indicates a 5% risk of concluding that an association exists when there is no actual association. In these results, two of the interaction terms have p-values that are greater than 0.05: Severe Shortness of Breath*Severe Headache and Severe Headache*Severe Sleep Disturbance. When the researchers explore other multiple regression models, they will use model performance metrics and residual plots to explore the effects of including these terms in the model.

The Model summary table shows that the training R2 and the test R2 are both approximately 91%. The test root mean squared error (RMSE), which represents how far the data values fall from the fitted values, is approximately 4. Because the RMSE is small on the scale of the illness score, the researchers are optimistic that a small number of questions is enough information to help prioritize patients.

The table of fits and diagnostics for unusual information shows data points that do not follow the proposed regression equation well. These are the fits and diagnostics from the full data set.

The letter R indicates a point with a large residual. Examine the unusual data points to see predictor values where the model might not fit well. The letter X indicates a point with high leverage. Points with high leverage have unusual predictor combinations relative to the rest of the data set.

Large residuals and high leverage points are potentially influential points. For example, the inclusion or exclusion of an influential point can change whether a coefficient is statistically significant or not. If you see an influential observation, determine whether the observation is a data-entry or measurement error. If the observation is not an error, determine how much the observation influences the results. When the researchers further explore the model, they will fit the model with and without the observations. Then, they will compare the coefficients, p-values, R2, and other model information. If the model changes significantly when you remove the influential observation, examine the model further to determine if you have incorrectly specified the model. You may need to gather more data to resolve the issue.

The scatterplot of the fitted illness scores versus actual illness scores shows the relationship between the fitted and actual values for both the training and test data. The points fall approximately near the reference line of y=x, which indicates that the model fits the data well.

Method

Fit a regression model with linear terms and terms of order 2.
Fit 6 TreeNet® Regression model(s) using squared loss function.
Fit 3 Random Forests® Regression model(s) with bootstrap sample size same as training data size of 1546.
Fit an optimal CART® Regression model.
Fit an optimal MARS® Regression model.
Select the model with maximum R-squared from 5-fold cross-valuation.
Total number of rows: 1546
Rows used for regression model: 1546
Rows used for tree-based models: 1546

Response Information

MeanStDevMinimumQ1MedianQ3Maximum
31.011014.0820019.0530.9540.4876.19
Best Model within TypeR-squared
(%)
Mean Absolute
Deviation
Multiple Regression*91.233.1011
MARS®91.053.1604
TreeNet®90.903.1613
Random Forests®89.933.3248
CART®86.113.9369
* Best model across all model types with maximum R-squared. Output for the best model
     follows.

Forward Selection of Terms with Validation for Best Multiple Regression Model

Selected terms: Number of Symptoms Now, High Production of Phlegm, Severe Shortness of
     Breath, Severe Headache, Severe Sleep Disturbance, Generally Feeling Very Bad, Limits on
     Normal Activities, Number of Symptoms Now*Severe Shortness of Breath, Number of Symptoms
     Now*Severe Chest Pain, Severe Shortness of Breath*Severe Sleep Disturbance, Generally Feeling
     Very Bad*Limits on Normal Activities
 

Regression Equation

Illness Severity Score=1.241 + 2.5386 Number of Symptoms Now
+ 0.0 High Production of Phlegm_0
+ 3.900 High Production of Phlegm_1
+ 0.0 Severe Shortness of Breath_0
+ 0.94 Severe Shortness of Breath_1 + 0.0 Severe Headache_0
+ 4.094 Severe Headache_1 + 0.0 Severe Sleep Disturbance_0
+ 3.884 Severe Sleep Disturbance_1
+ 0.0 Generally Feeling Very Bad_0
+ 3.473 Generally Feeling Very Bad_1
+ 0.0 Limits on Normal Activities_0
+ 3.140 Limits on Normal Activities_1
+ 0.0 Number of Symptoms Now*Severe Shortness of Breath_0
+ 0.373 Number of Symptoms Now*Severe Shortness of Breath_1
+ 0.0 Number of Symptoms Now*Severe Chest Pain_0
+ 0.4765 Number of Symptoms Now*Severe Chest Pain_1
+ 0.0 Severe Shortness of Breath*Severe Sleep Disturbance_0 0
+ 0.0 Severe Shortness of Breath*Severe Sleep Disturbance_0 1
+ 0.0 Severe Shortness of Breath*Severe Sleep Disturbance_1 0
+ 1.337 Severe Shortness of Breath*Severe Sleep Disturbance_1 1
+ 0.0 Generally Feeling Very Bad*Limits on Normal Activities_0 0
+ 0.0 Generally Feeling Very Bad*Limits on Normal Activities_0 1
+ 0.0 Generally Feeling Very Bad*Limits on Normal Activities_1 0
+ 1.372 Generally Feeling Very Bad*Limits on Normal Activities_1 1

Coefficients

TermCoefSE CoefT-ValueP-Value
Constant1.2410.3853.220.001
Number of Symptoms Now2.53860.059342.810.000
High Production of Phlegm       
  13.9000.22517.350.000
Severe Shortness of Breath       
  10.941.180.800.424
Severe Headache       
  14.0940.25316.180.000
Severe Sleep Disturbance       
  13.8840.28413.690.000
Generally Feeling Very Bad       
  13.4730.34310.140.000
Limits on Normal Activities       
  13.1400.4247.400.000
Number of Symptoms Now*Severe Shortness of Breath       
  10.3730.1332.810.005
Number of Symptoms Now*Severe Chest Pain       
  10.47650.031215.260.000
Severe Shortness of Breath*Severe Sleep Disturbance       
  1 11.3370.5282.530.011
Generally Feeling Very Bad*Limits on Normal Activities       
  1 11.3720.5272.610.009
TermVIF
Constant 
Number of Symptoms Now1.95
High Production of Phlegm 
  11.10
Severe Shortness of Breath 
  123.23
Severe Headache 
  11.25
Severe Sleep Disturbance 
  11.73
Generally Feeling Very Bad 
  12.62
Limits on Normal Activities 
  13.98
Number of Symptoms Now*Severe Shortness of Breath 
  126.80
Number of Symptoms Now*Severe Chest Pain 
  11.25
Severe Shortness of Breath*Severe Sleep Disturbance 
  1 13.26
Generally Feeling Very Bad*Limits on Normal Activities 
  1 15.73

Model Summary

StatisticsTrainingTest
R-squared91.35%91.23%
Root mean squared error (RMSE)4.15624.1679
Mean squared error (MSE)17.274117.3714
Mean absolute deviation (MAD)3.07983.1011
     
R-squared (adj)91.29% 
R-squared (pred)  91.19%

Analysis of Variance

SourceDFAdj SSAdj MSF-Value
Regression1127988125443.71472.94
  Number of Symptoms Now13165531654.81832.51
  High Production of Phlegm152025201.8301.14
  Severe Shortness of Breath11111.10.64
  Severe Headache145204520.0261.66
  Severe Sleep Disturbance132393238.8187.50
  Generally Feeling Very Bad117761775.6102.79
  Limits on Normal Activities1945945.454.73
  Number of Symptoms Now*Severe Shortness of Breath1136136.47.90
  Number of Symptoms Now*Severe Chest Pain140234023.4232.92
  Severe Shortness of Breath*Severe Sleep Disturbance1111110.76.41
  Generally Feeling Very Bad*Limits on Normal Activities1117117.36.79
Error15342649817.3 
  Lack-of-Fit484924719.11.16
  Pure Error10501725116.4 
Total1545306379   
SourceP-Value
Regression0.000
  Number of Symptoms Now0.000
  High Production of Phlegm0.000
  Severe Shortness of Breath0.424
  Severe Headache0.000
  Severe Sleep Disturbance0.000
  Generally Feeling Very Bad0.000
  Limits on Normal Activities0.000
  Number of Symptoms Now*Severe Shortness of Breath0.005
  Number of Symptoms Now*Severe Chest Pain0.000
  Severe Shortness of Breath*Severe Sleep Disturbance0.011
  Generally Feeling Very Bad*Limits on Normal Activities0.009
Error 
  Lack-of-Fit0.025
  Pure Error 
Total 

Fits and Diagnostics for Unusual Observations

ObsIllness
Severity
Score
FitResidStd Resid
1166.67056.7579.9132.40R 
1352.38041.17711.2032.71R 
1659.52048.60410.9162.64R 
3350.00060.657-10.657-2.57R 
4864.29055.4168.8742.14R 
5261.90053.3698.5312.06R 
5450.00041.5988.4022.03R 
5650.00058.328-8.328-2.02R 
5838.10046.485-8.385-2.03R 
10659.52049.02810.4922.53R 
11459.52047.16012.3602.99R 
12869.05058.32810.7222.59R 
14450.00040.4719.5292.30R 
17347.62056.757-9.137-2.21R 
17442.86034.0008.8602.14R 
19142.86052.051-9.191-2.23R 
19859.52048.41111.1092.68R 
20273.81064.0469.7642.36R 
20547.62037.55910.0612.43R 
21335.71034.9700.7400.18  X
21716.67019.053-2.383-0.58  X
23947.62058.328-10.708-2.59R 
24171.43066.3115.1191.25  X
24314.29024.088-9.798-2.36R 
30450.00041.1308.8702.14R 
30714.29010.9203.3700.83  X
35264.29051.25413.0363.15R 
36938.10049.275-11.175-2.70R 
39116.67032.073-15.403-3.72R 
3920.00011.395-11.395-2.75R 
3950.00013.934-13.934-3.36R 
42440.48052.504-12.024-2.90R 
42547.62034.59713.0233.16R 
47447.62038.5389.0822.21R 
47940.48030.8969.5842.31R 
48916.67025.023-8.353-2.02R 
49130.95024.3486.6021.61  X
49357.14044.33912.8013.09R 
49535.71025.48010.2302.47R 
50938.10026.69611.4042.77R 
52073.81058.32815.4823.75R 
53738.10028.3589.7422.35R 
55014.29024.458-10.168-2.45R 
58342.86053.369-10.509-2.54R 
69419.05021.817-2.767-0.68  X
72059.52065.602-6.082-1.49  X
72240.48032.0668.4142.03R 
80230.95042.586-11.636-2.81R 
80530.95039.868-8.918-2.16R 
81440.48032.0738.4072.03R 
82361.90048.14813.7523.33R 
83333.33044.054-10.724-2.60R 
85938.10049.275-11.175-2.70R 
86847.62037.7899.8312.38R 
89130.95019.94511.0052.66R 
89328.57048.860-20.290-4.92R 
90545.24055.416-10.176-2.46R 
92454.76056.019-1.259-0.31  X
97764.29053.10711.1832.72R 
98357.14047.6839.4572.29R 
98850.00044.5015.4991.34  X
99373.81064.0469.7642.36R 
99733.33024.4588.8722.14R 
100354.76045.1289.6322.33R 
102533.33047.705-14.375-3.49R 
105957.14048.6638.4772.05R 
110547.62037.31910.3012.49R 
115059.52044.33915.1813.67R 
116052.38040.05112.3292.97R 
116330.95041.598-10.648-2.57R 
116569.05056.75712.2932.97R 
116959.52049.27510.2452.48R 
119842.86051.516-8.656-2.09R 
120776.19063.53412.6563.07R 
121326.19040.278-14.088-3.41R 
122840.48050.571-10.091-2.45R 
123559.52050.1759.3452.26R 
123757.14048.2398.9012.15R 
124664.29055.4168.8742.14R 
126245.24035.9579.2832.24R 
126357.14043.95113.1893.18R 
128233.33036.011-2.681-0.65  X
128445.24056.564-11.324-2.74R 
128547.62060.657-13.037-3.15R 
130326.19036.567-10.377-2.51R 
130535.71045.499-9.789-2.36R 
131130.95040.089-9.139-2.21R 
134526.19025.1051.0850.26  X
135342.86053.175-10.315-2.49R 
136526.19017.8348.3562.01R 
137747.62035.22212.3983.00R 
138069.05055.41613.6343.29R 
138450.00038.49611.5042.78R 
141426.19035.345-9.155-2.21R 
150261.90050.19511.7052.84R 
152638.10025.45012.6503.05R 
153514.29024.088-9.798-2.36R 
154438.10029.1658.9352.16R 
154850.00040.4559.5452.31R 
156538.10042.846-4.746-1.16  X
158266.67055.43711.2332.72R 
R  Large residual
X  Unusual X

Select an alternative model

The researchers decide to examine the results for the best TreeNet® model.

  1. In the results for Discover Best Model (Continuous Response), select Select Alternative Model.
  2. In Model Type, select TreeNet®.
  3. In Select an existing model, choose the sixth model, which has the best value of R2.
  4. Click Display Results.

Interpret the results

This analysis grows 300 trees and the optimal number of trees is 63. The model uses a learning rate of 0.1 and a subsample fraction of 0.7. The maximum number of terminal nodes is 6.

Method

Loss functionSquared error
Criterion for selecting optimal number of treesMaximum R-squared
Model validation5-fold cross-validation
Learning rate0.1
Subsample fraction0.7
Maximum terminal nodes per tree6
Minimum terminal node size3
Number of predictors selected for node splittingTotal number of predictors = 8
Rows used1546
Rows unused70

Response Information

MeanStDevMinimumQ1MedianQ3Maximum
31.011014.0820019.0530.9540.4876.19

The R-squared vs Number of Trees Plot shows the entire curve over the number of trees grown. The optimal value for the test data is about 91% when the number of trees is 63.

Model Summary

Total predictors8
Important predictors8
Number of trees grown300
Optimal number of trees63
StatisticsTrainingTest
R-squared91.93%90.90%
Root mean squared error (RMSE)3.99924.2471
Mean squared error (MSE)15.993218.0375
Mean absolute deviation (MAD)2.99433.1613
Mean absolute percent error (MAPE)0.10880.1130

The Model summary table shows that the R2 value when the number of trees is 63 is approximately 92% for the training data and approximately 91% for the test data.

The Relative Variable Importance graph plots the predictors in order of their effect on model improvement when splits are made on a predictor over the sequence of trees. The most important predictor variable is Number of Symptoms Now. If the contribution of the top predictor variable, Number of Symptoms Now, is 100%, then the next important variable, Limits on Normal Activities, has a contribution of 44.4%. This means Limits on Normal Activities is 44.4% as important as Number of Symptoms Now in this regression model.

The scatterplot of the fitted illness scores versus actual illness scores shows the relationship between the fitted and actual values for both the training and test data. The points fall approximately near the reference line of y=x, which indicates that the model fits the data well.

Use the partial dependency plots to gain insight into how the important variables or pairs of variables affect the fitted response values. The partial dependence plots show whether the relationship between the response and a variable is linear, monotonic, or more complex.

The first plot illustrates the relationship between the illness scores and the number of symptoms the patient has now. You can hover over individual data points to see the specific x- and y-values. For instance, the highest point on the right side of the graph is when the patient has 13 symptoms and the fitted illness score is approximately 45.

The second plot illustrates that the fitted illness score increases by approximately 5 points when patients report limitations on their normal activities.

The third plot illustrates that the fitted illness score increases by approximately 5 points when patients report generally feeling very bad.

The fourth plot illustrates the fitted illness score increases by approximately 4 points when patients report severe shortness of breath.

The last plot illustrates how the fitted illness score for a number of symptoms depends on whether the patient also has limits on their normal activities. For the same number of symptoms, patients who also report limits on their normal activities have higher fitted illness scores.