The coding schemes for categorical predictors

When you perform a regression analysis with categorical predictors, Minitab uses a coding scheme to make indicator variables out of the categorical predictor. When models get more complicated, interpretations are similar. However, if you add a covariate or have unequal sample sizes within each group, coefficients are based on weighted means for each factor level instead of the arithmetic mean (sum of the observations divided by n). The interpretation is usually the same, however:
  • Using 1, 0 coding, coefficients represent the distance between factor levels and their baseline level.
  • Using 1, 0, -1 coding coefficients represent the distance between factor levels and the overall mean.

By default, Minitab uses the (1,0) coding scheme for regression, but you can choose to change it to the (-1, 0, +1) coding scheme in the Coding subdialog box. For more information, go to Coding schemes for categorical predictors.

Interpret coding schemes for models that have one factor

The data for examples with one factor

First, consider a balanced, one factor design with three levels for the factor.

C1 C2 - T
Response Factor
1 A
3 A
2 A
2 A
4 B
6 B
3 B
5 B
8 C
9 C
7 C
10 C

The descriptive statistics for examples with one factor

Examine the descriptive statistics, concentrating on the means.

Statistics Total Variable Count Mean Response 12 5.000
Statistics Total Variable Factor Count Mean Response A 4 2.000 B 4 4.500 C 4 8.500

Example of interpreting the coding scheme for a cell means model (0, 1) with one factor

To get the output do the following:
  1. Choose Stat > Regression > Regression > Fit Regression Model.
  2. In Responses, enter Response.
  3. In Categorical predictors, enter Factor.
  4. Click Coding. Under Reference level, choose C.
  5. Click OK in each dialog.
Coefficients Term Coef SE Coef T-Value P-Value VIF Constant 8.500 0.577 14.72 0.000 Factor A -6.500 0.816 -7.96 0.000 1.33 B -4.000 0.816 -4.90 0.001 1.33
Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 2 86.00 43.000 32.25 0.000 Factor 2 86.00 43.000 32.25 0.000 Error 9 12.00 1.333 Total 11 98.00
Remember that the factor level means are:
  • A = 2.0
  • B = 4.5
  • C = 8.5

The estimated regression equation is:

Regression Equation Response = 8.500 - 6.500 Factor_A - 4.000 Factor_B + 0.0 Factor_C

Level C is the baseline, and thus has a coefficient of 0. In the case of only one factor, the intercept is equal to the mean of the baseline level.

The coefficient corresponding to level A is –6.5. It is the difference that level A is from the baseline level. If you take the coefficient for A and add the intercept (or baseline mean) to it, you get the mean for level A: –6.5 + 8.5 = 2.0

Similarly, the coefficient corresponding to level B is –4.0. It is the difference that level B is from the baseline level. If you take the coefficient for level B and add the intercept, you get the mean for level B: –4.0 + 8.5 = 4.5

Example of interpreting the coding scheme for a factor effects model (-1, 0, +1) with one factor

To get the following output:
  1. Choose Stat > Regression > Regression > Fit Regression Model.
  2. In Responses, enter Response.
  3. In Categorical predictors, enter Factor.
  4. Click Coding. Under Coding for categorical predictors, choose (-1, 0, +1).
  5. Click OK in each dialog.

Regression Analysis: Response versus Factor

Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 2 86.00 43.000 32.25 0.000 Factor 2 86.00 43.000 32.25 0.000 Error 9 12.00 1.333 Total 11 98.00
Coefficients Term Coef SE Coef T-Value P-Value VIF Constant 5.000 0.333 15.00 0.000 Factor A -3.000 0.471 -6.36 0.000 1.33 B -0.500 0.471 -1.06 0.316 1.33
Remember the overall mean and the factor level means:
  • Overall Mean = 5.0
  • A = 2.0
  • B = 4.5
  • C = 8.5

The regression equation is:

Regression Analysis: Response versus Factor

Regression Equation Response = 5.000 - 3.000 Factor_A - 0.500 Factor_B + 3.500 Factor_C
The effect for any specific factor level is the Level Mean – Overall Mean. Thus,
  • Level A effect = 2.0 - 5.0 = -3.0
  • Level B effect = 4.5 - 5.0 = -0.5
  • Level C effect = 8.5 - 5.0 = 3.5

The intercept is the overall mean.

The coefficient for A is the effect for factor level A. It is the difference between the mean for level A and the overall mean.

The coefficient for B is the effect for factor level B. It is the difference between the mean for level B and the overall mean.

You can obtain the effect size for level C by adding all the coefficients (excluding the intercept) and multiplying by a negative 1: -1 * [(-3.0) + (-0.5)] = 3.5

You can get the level means by taking the effect size and adding the overall mean:
  • Mean for Level A = coefficient for A + Intercept = -3.0 + 5.0 = 2.0
  • Mean for Level B = coefficient for B + Intercept = -0.5 + 5.0 = 4.5
  • Mean for Level C = Intercept - coefficient for A - coefficient for B = 5.0 – (- 3.0) – (-0.5) = 5.0 + 3.0 + 0.5 = 8.5

Interpret coding schemes for the two factor case

The data for examples with two factors

Now consider a balanced, two factor design with three levels for the first factor and two levels for the second factor.

C1 C2 - T C3 - T
Response Factor 1 Factor 2
1 A High
3 A Low
2 A High
2 A Low
4 B High
6 B Low
3 B High
5 B Low
8 C High
9 C Low
7 C High
10 C Low

The descriptive statistics for examples with two factors

Examine the descriptive statistics, concentrating on the means.

Rows: Factor 1 Columns: Factor 2 High Low All A 1.500 2.500 2.000 B 3.500 5.500 4.500 C 7.500 9.500 8.500 All 4.167 5.833 5.000 Cell Contents Response : Mean

Example of interpreting the coding scheme for a cell means model (0, 1) with two factors

To get the following output:
  1. Choose Stat > Regression > Regression > Fit Regression Model.
  2. In Responses, enter Response.
  3. In Categorical predictors, enter Factor 1 and Factor 2.
  4. Click Coding. Under Coding for categorical predictors, choose (1, 0).
  5. Under Reference level, choose C for Factor 1 and Low for Factor 2.
  6. Click OK in each dialog.
Coefficients Term Coef SE Coef T-Value P-Value VIF Constant 9.333 0.391 23.88 0.000 Factor 1 A -6.500 0.479 -13.58 0.000 1.33 B -4.000 0.479 -8.36 0.000 1.33 Factor 2 High -1.667 0.391 -4.26 0.003 1.00
Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 3 94.3333 31.4444 68.61 0.000 Factor 1 2 86.0000 43.0000 93.82 0.000 Factor 2 1 8.3333 8.3333 18.18 0.003 Error 8 3.6667 0.4583 Lack-of-Fit 2 0.6667 0.3333 0.67 0.548 Pure Error 6 3.0000 0.5000 Total 11 98.0000
Remember that the factor level means are:
  • A = 2.0
  • B = 4.5
  • C = 8.5

The estimated regression equation is:

Regression Equation Response = 9.333 - 6.500 Factor 1_A - 4.000 Factor 1_B + 0.0 Factor 1_C - 1.667 Factor 2_High + 0.0 Factor 2_Low

Again, the coefficient corresponding to level A is –6.5. This is still the distance that level A is from the baseline level (Level C). If you take the mean for level A and subtract from it the mean for the baseline level, you get the coefficient: 2 – 8.5 = -6.5.

Similarly, the coefficient corresponding to level B is still –4.0. It is the distance that level B is from the baseline level for factor 1. If you take the mean for level B and subtract from it the mean for the baseline level, you get the coefficient: 4.5 - 8.5 = -4.0.

Finally, the coefficient corresponding to the High level of factor 2 is the distance that “High” is from the baseline level for factor 2 (Low). So, if you take the mean for the High level of factor 2 and subtract from it the mean for the baseline level for factor 2, you get the coefficient: 4.1667 – 5.8333 = -1.667.

Example of interpreting the coding scheme for a factor effects model (-1, 0, +1) with two factors

To get the following output:
  1. Choose Stat > Regression > Regression > Fit Regression Model.
  2. In Responses, enter Response.
  3. In Categorical predictors, enter Factor 1 and Factor 2.
  4. Click Coding. Under Coding for categorical predictors, choose (-1, 0, +1).
  5. Click OK in each dialog.

Regression Analysis: Response versus Factor 1

Analysis of Variance Source DF Adj SS Adj MS F-Value P-Value Regression 2 86.00 43.000 32.25 0.000 Factor 1 2 86.00 43.000 32.25 0.000 Error 9 12.00 1.333 Total 11 98.00
Coefficients Term Coef SE Coef T-Value P-Value VIF Constant 5.000 0.333 15.00 0.000 Factor 1 A -3.000 0.471 -6.36 0.000 1.33 B -0.500 0.471 -1.06 0.316 1.33

Notice that with this coding scheme the coefficients haven’t changed from the one factor model. You now have an additional coefficient for the second factor.

Now consider the overall mean and the factor level means:
  • Overall Mean = 5.0
  • A = 2.0
  • B = 4.5
  • C = 8.5
  • High = 4.1667
  • Low = 5.8333

The regression equation is:

Regression Analysis: Response versus Factor 1

Regression Equation Response = 5.000 - 3.000 Factor 1_A - 0.500 Factor 1_B + 3.500 Factor 1_C
The effect for any specific factor level is the Level Mean – Overall Mean. Thus,
  • Level A effect = 2.0 - 5.0 = -3.0
  • Level B effect = 4.5 - 5.0 = -0.5
  • Level C effect = 8.5 - 5.0 = 3.5
  • Level High effect = 4.1667 – 5.0 = -0.883
  • Level Low effect = 5.8333 – 5.0 = 0.883
Note

When you have only two levels and equal sample sizes, the level effect will be equal in magnitude because the mean is exactly in the middle.

The intercept is the overall mean.

The coefficients are the effect for each factor level. They represent the difference between the mean for that level and the overall mean.

By using this site you agree to the use of cookies for analytics and personalized content.  Read our policy