Data considerations for Chi-Square Test for Association

To ensure that your results are valid, consider the following guidelines when you collect data, perform the analysis, and interpret your results.

Collect raw data or summary data

You can use two columns of raw data, or summarized data in the form of a contingency table. If your data are in frequency form, use Cross Tabulation and Chi-Square.

Note

Missing values are not allowed in a contingency table.

The sample should be selected randomly

For each level of X, you collect a random sample of items that are representative of the process. The levels of the X variable may represent different processes or locations. For example, if you have several branch offices that process invoices, you should collect a sample of invoices from each branch.

Random samples are used to make generalizations, or inferences, about a population. If your data are not collected randomly, your results may not be valid.

Each observation should be independent from all other observations

Independence of the observations is a critical assumption for the chi-square test of association.

Data must be categorical

Categorical variables contain a finite, countable number of categories or distinct groups. Categorical data might not have a logical order. For example, categorical predictors include gender, material type, and payment method.

All the data must be categorized into mutually exclusive categories, with no overlap

The chi-square test of association cannot be performed when categories of the variables overlap. Thus, each observation must be categorized into one and only one category.

The expected counts must not be too small

Each sample should be large enough so that there is a reasonable chance of observing outcomes in every category. If the expected counts are too low, the p-value for the test may not be accurate. Minitab indicates whether the expected counts are too low and how large each sample should be to ensure the validity of the test.

If the expected count for a category is too low, you may be able to combine that category with adjacent categories to achieve the minimum expected count. You should combine categories only when necessary because you lose information when you combine categories.