Methods and formulas for test for 2x2 tables for Cross Tabulation and Chi-Square

Select the method or formula of your choice.

In This Topic

Fisher's exact test
McNemar's exact test
Cochran-Mantel-Haenszel test

Fisher's exact test

Fisher's exact test is a test of independence. The test is based on an exact distribution rather than on the approximate chi-square distribution used for the Pearson's and likelihood ratio tests. Fishers exact test is useful when the expected cell counts are low and the chi-square approximation is not very good.

Formula

The p-value is based on a hypergeometric distribution with the following parameters:

population size: total number of observations
number of successes in the population: number of observations in the first row
sample size: number of observations in the first column

The p-value is for a two-sided alternative. The computation is the sum of the hypergeometric probabilities over the values from 0 to the population size that are less than or equal to the probability of the observed value.

Example

Suppose you want to calculate the p-value for Fisher's exact test for cookie preference between adults and children.

	Child	Adult	Row Total
Sugar	9	1	10
Chocolate Chip	2	8	10
Column Total	11	9	20

Fisher showed that the probability of obtaining this set of values follows the hypergeometric distribution

where in the example looks like this:

	Child	Adult	Row Total
Sugar	a	b	a+b
Chocolate Chip	c	d	c+d
Column Total	a+c	b+d	a+b+c+d

In the case of a 2x2 matrix, you can calculate the p-value of the test by summing all the p-values which are less than the conditional probability of the actual matrix. Thus, p_cutoff.

For this example, the sum of the p-values that are less than or equal to the p_cutoff for the other possible matrices is 0.0054775.

McNemar's exact test

McNemar's test compares proportions that are observed before and after a treatment. For example, you can use McNemar's test to determine whether a training program changes the proportion of participants who correctly answer a question.

Observations for McNemar's test can be summarized in a two-by-two table, as shown below.

	After Treatment
Before Treatment	Condition True	Condition Not True	Total
Condition True	n₁₁	n₁₂	n_{1_.}
Condition Not True	n₂₁	n₂₂	n_{2_.}
Total	n_·1	n_·2	n_··

The condition for the training example is a correct answer. Thus, n₂₁ represents the number of participants who answer the question correctly after training but not before training. And n₁₂ represents the number of participants who answer the question correctly before training but not after training. The total number of participants is represented by n_...

Estimated difference

Let δ be the difference between the marginal probabilities, p_1.- p_.1, in the population. The estimated difference, , is given by the following formula:

Confidence interval

An approximate 100(1 – α)% confidence interval is given by the following formula:

where α is the significance level for the test, z _α/2 is the z-score associated with a tail probability of α/2, and SE is given by the following formula:

P-value

The null hypothesis is δ = 0. The exact p-value for the test of the null hypothesis is calculated as:

where X is a random variable that is drawn from a binomial distribution with an event probability of 0.5 and a number of trials equal to n₂₁ + n₁₂.

Cochran-Mantel-Haenszel test

The test assumes no three-way interaction exists. The purpose of the test is to assess the degree of relationship between two dichotomous variables while controlling for a nuisance variable. The CMH statistic is compared with a chi-square percentile with one degree of freedom.

The Cochran-Mantel-Haenszel (CMH) test applies only if three or more classification variables exist, and the first two variables have two levels each. All variables beyond the first two are treated as a single variable Z for the purposes of the CMH test, with each combination of levels treated as a level of Z.

Formula

Notation

Term	Description
k	level of Z
n_11k	number of observations in first row, first column
n_1+k	number of observations in first row
n_+1k	number of observations in first column
n_++k	total number of observations
n_2+k	number of observations in second row
n_+2k	number of observations in second column