Model reduction

Model reduction is the elimination of terms from the model, such as the term for a predictor variable or the interaction between predictor variables. Model reduction lets you simplify a model and increase the precision of predictions. You can reduce models in any group of commands in Minitab, including regression, ANOVA, DOE, and reliability.

One criterion for model reduction is the statistical significance of a term. The elimination of statistically insignificant terms increases the precision of predictions from the model. To use the statistical significance criterion, first choose a significance level such as 0.05 or 0.15. Then, try different terms to find a model with as many statistically significant terms as possible but with no statistically insignificant terms. To use the statistical significance criterion, the data must provide enough degrees of freedom to estimate statistical significance after you fit the model. You can apply the statistical significance criterion manually, or automatically with an algorithmic procedure, such as stepwise regression. The purpose of the statistical significance criterion is to find a model that meets your goals. However, the statistical significance criterion does always produce the one best model.

Besides the statistical significance criterion, other statistical criteria that Minitab calculates for models include S, adjusted R², predicted R², PRESS, Mallows' Cp, and the Akaike Information Criterion (AIC). You can consider one or more of these criteria when you reduce a model.

Like stepwise regression, best subsets regression is an algorithmic procedure you can use to find a model that meets your goals. Best subsets regression examines all models and identifies the models that have the highest R² values. In Minitab, best subsets regression also displays other statistics, such as adjusted R² and predicted R². You can consider these statistics when you compare models. Because best subsets uses R², the models that best subsets regression identifies as the best models might or might not have only statistically significant terms. Other statistical criteria to consider as you reduce a model include multicollinearity and hierarchy. These two concepts are discussed in more detail below.

Statistics that measure how well the model fits the data can help you find a useful model. However, you should also use process knowledge and good judgment to decide which terms to eliminate. Some terms might be essential, whereas other terms might be too costly or too difficult to measure.

Example of reducing a model in a simple case

Technicians measure the total heat flux as part of a solar thermal energy test. An energy engineer wants to determine how total heat flux is predicted by other variables: insolation, the position of the focal points in the east, south, and north directions, and the time of day. Using the full regression model, the engineer determines the following relationship between heat flux and the variables.

Regression Equation Heat Flux = 325.4 + 2.55 East + 3.80 South - 22.95 North + 0.0675 Insolation + 2.42 Time of Day

Coefficients Term Coef SE Coef T-Value P-Value VIF Constant 325.4 96.1 3.39 0.003 East 2.55 1.25 2.04 0.053 1.36 South 3.80 1.46 2.60 0.016 3.18 North -22.95 2.70 -8.49 0.000 2.61 Insolation 0.0675 0.0290 2.33 0.029 2.32 Time of Day 2.42 1.81 1.34 0.194 5.37

The engineer wants to eliminate as many insignificant terms as possible to maximize the precision of predictions. The engineer decides to use 0.05 as the threshold for statistical significance. The p-value for Time of Day (0.194) is the highest p-value that is greater than 0.05, so the engineer removes this term first. The engineer repeats the regression, removing one insignificant term each time, until only statistically significant terms remain. The final reduced model is as follows:

Regression Equation Heat Flux = 483.7 + 4.796 South - 24.22 North

Coefficients Term Coef SE Coef T-Value P-Value VIF Constant 483.7 39.6 12.22 0.000 South 4.796 0.951 5.04 0.000 1.09 North -24.22 1.94 -12.48 0.000 1.09

Multicollinearity

Multicollinearity in regression is a condition that occurs when some predictor variables in the model are correlated with other predictor variables. Severe multicollinearity is problematic because it can increase the variance of the regression coefficients, making them unstable. When you remove a term that has high multicollinearity, the statistical significance and values of the coefficients of highly correlated terms can change considerably. Thus, in the presence of multicollinearity, examining multiple statistics and changing the model one term at a time are even more important. Usually, you reduce as much multicollinearity as possible before you reduce a model. For more information on ways to reduce multicollinearity, go to Multicollinearity in regression.

Example of how multicollinearity interferes with the statistical significance criterion

A team at a medical facility develops a model to predict patient satisfaction scores. The model has several variables, including the time patients are with a practitioner and the time patients are in medical tests. With both of these variables in the model, the multicollinearity is high, with VIF (variance inflation factor) values of 8.91. Values greater than 5 usually indicate severe multicollinearity. The p-value for the amount of time that patients are with a practitioner is 0.105, which is not significant at the 0.05 level. The predicted R² value for this model is 22.9%.

Regression Analysis: Satisfaction versus Practitioner Time, Test Time

Model Summary S R-sq R-sq(adj) R-sq(pred) 0.951953 28.68% 25.64% 22.91%

Coefficients Term Coef SE Coef T-Value P-Value VIF Constant -0.078 0.156 -0.50 0.618 Practitioner Time 0.1071 0.0648 1.65 0.105 8.91 Test Time -0.516 0.178 -2.90 0.006 8.91

The predicted R² value for the model with only test time drops from 22.9% to 10.6%. Although the time patients are with a practitioner is not statistically significant at the 0.05 level, including that variable more than doubles the predicted R² value. The high multicollinearity could be hiding the importance of the predictor.

Regression Analysis: Satisfaction versus Test Time

Model Summary S R-sq R-sq(adj) R-sq(pred) 0.968936 24.54% 22.96% 10.61%

Coefficients Term Coef SE Coef T-Value P-Value VIF Constant -0.162 0.150 -1.08 0.285 Test Time -0.2395 0.0606 -3.95 0.000 1.00

Hierarchy

A hierarchical model is a model where, for each term in the model, all lower order terms are also in the model. For example, suppose a model has four factors: A, B, C, and D. If the term A*B*C is in the model, then the terms A, B, C, A*B, A*C, and B*C must also be in the model. Any terms with D do not have to be in the model because D is not in the term A*B*C. The hierarchical structure applies to nesting as well. If B(A) is in the model, then A must also be in the model for the model to be hierarchical.

Hierarchy is desirable because hierarchical models can be translated from standardized to unstandardized units. Standardized units are common when the model includes higher order terms like interactions because the standardization reduces the multicollinearity that these terms cause.

Because hierarchy is desirable, hierarchical model reduction is common. For example, one strategy is to use the p-value criterion to reduce the model in combination with hierarchy. First, you remove the most complex terms that are statistically insignificant. If a statistically insignificant term is part of an interaction term or a higher-order term, then the term stays in the model. Minitab's stepwise model selection can use the hierarchy criterion and the statistical significance criterion.

Example of hierarchical model reduction

A materials engineer for a building products manufacturer is developing a new insulation product. The engineer designs a 2-level full factorial experiment to assess several factors that could affect the insulating value of the insulation. The engineer includes interactions in the model to determine whether the effects of the factors depend on each other. Because interactions create multicollinearity, the engineer codes the predictors to reduce the multicollinearity.

The highest p-value for the first model that the engineer examines is 0.985 for the interaction between injection temperature and material. Below the table of coded coefficients, the engineer can examine the regression equation in uncoded units. The regression equation helps the engineer to understand the size of the effects in the same units as the data.

Regression Analysis: Insulation versus InjPress, InjTemp, CoolTemp, Material

Coded Coefficients Term Coef SE Coef T-Value P-Value VIF Constant 17.463 0.203 86.13 0.007 InjPress 1.835 0.203 9.05 0.070 2.00 InjTemp 1.276 0.203 6.29 0.100 2.00 CoolTemp 2.173 0.203 10.72 0.059 2.00 Material Formula2 5.192 0.287 18.11 0.035 1.00 InjPress*InjTemp -0.036 0.203 -0.18 0.887 2.00 InjPress*CoolTemp 0.238 0.203 1.17 0.449 2.00 InjTemp*CoolTemp 1.154 0.203 5.69 0.111 2.00 InjPress*Material Formula2 -0.198 0.287 -0.69 0.615 2.00 InjTemp*Material Formula2 -0.007 0.287 -0.02 0.985 2.00 CoolTemp*Material Formula2 -0.898 0.287 -3.13 0.197 2.00 InjPress*InjTemp*CoolTemp 0.100 0.143 0.70 0.611 1.00 InjPress*InjTemp*Material Formula2 0.181 0.287 0.63 0.642 2.00 InjPress*CoolTemp*Material Formula2 -0.385 0.287 -1.34 0.408 2.00 InjTemp*CoolTemp*Material Formula2 -0.229 0.287 -0.80 0.570 2.00

Regression Equation in Uncoded Units Material Formula1 Insulation = 26.6 + 0.154 InjPress - 0.213 InjTemp - 0.906 CoolTemp - 0.00138 InjPress*InjTemp - 0.00267 InjPress*CoolTemp + 0.01137 InjTemp*CoolTemp + 0.000036 InjPress*InjTemp*CoolTemp Formula2 Insulation = 28.3 + 0.125 InjPress - 0.179 InjTemp - 0.597 CoolTemp - 0.00073 InjPress*InjTemp - 0.00369 InjPress*CoolTemp + 0.00831 InjTemp*CoolTemp + 0.000036 InjPress*InjTemp*CoolTemp

If the engineer uses only the p-value criterion to reduce the model, then the next model is non-hierarchical because you remove a two-factor interaction that is part of a three-factor interaction. Because the model is nonhierarchical, the uncoded coefficients do not exist. Thus, the regression equation for the non-hierarchical model is in coded units. The coded regression equation does not provide any information about the effects in the same units as the data.

Regression Analysis: Insulation versus InjPress, InjTemp, CoolTemp, Material

Coded Coefficients Term Coef SE Coef T-Value P-Value VIF Constant 17.463 0.143 121.77 0.000 InjPress 1.835 0.143 12.80 0.006 2.00 InjTemp 1.272 0.101 12.55 0.006 1.00 CoolTemp 2.173 0.143 15.15 0.004 2.00 Material Formula2 5.192 0.203 25.60 0.002 1.00 InjPress*InjTemp -0.036 0.143 -0.25 0.824 2.00 InjPress*CoolTemp 0.238 0.143 1.66 0.239 2.00 InjTemp*CoolTemp 1.154 0.143 8.04 0.015 2.00 InjPress*Material Formula2 -0.198 0.203 -0.98 0.431 2.00 CoolTemp*Material Formula2 -0.898 0.203 -4.43 0.047 2.00 InjPress*InjTemp*CoolTemp 0.100 0.101 0.99 0.427 1.00 InjPress*InjTemp*Material Formula2 0.181 0.203 0.89 0.466 2.00 InjPress*CoolTemp*Material Formula2 -0.385 0.203 -1.90 0.198 2.00 InjTemp*CoolTemp*Material Formula2 -0.229 0.203 -1.13 0.375 2.00

Regression Equation in Coded Units Material Formula1 Insulation = 17.463 + 1.835 InjPress + 1.272 InjTemp + 2.173 CoolTemp - 0.036 InjPress*InjTemp + 0.238 InjPress*CoolTemp + 1.154 InjTemp*CoolTemp + 0.100 InjPress*InjTemp*CoolTemp Formula2 Insulation = 22.655 + 1.637 InjPress + 1.272 InjTemp + 1.275 CoolTemp + 0.145 InjPress*InjTemp - 0.147 InjPress*CoolTemp + 0.924 InjTemp*CoolTemp + 0.100 InjPress*InjTemp*CoolTemp

Instead of using only the p-value criterion, the engineer decides to remove the most complex terms that have high p-values first. In this model, instead of removing the term that has the highest p-value, the engineer removes the 3-way interaction that has the highest p-value. The highest p-value for a 3-way interaction is 0.466 for the interaction between injection pressure, injection temperature, and material.