Data considerations for Correlation

To ensure that your results are valid, consider the following guidelines when you collect data, perform the analysis, and interpret your results.

The data must include at least 2 columns of numeric or date/time data
All columns must have the same number of rows.
The data should be continuous or ordinal
If you have categorical data, you should perform Cross Tabulation and Chi-Square to examine the association between variables.
The sample size should be medium to large, n ≥ 25
Although there are no formal guidelines for the amount of data needed for a correlation, larger samples more clearly indicate patterns in the data and provide more precise estimates.
The relationship between variables should be linear or monotonic
If your variables do not have a linear or monotonic relationship, the results from the correlation analysis will not accurately reflect the strength of the relationship. Examine the matrix plot to look for other relationships.
Unusual values can have a strong effect on the results
Because unusual values can have a strong effect on the results, use the matrix plot to identify these values. You should investigate outliers because they can provide useful information about your data or process.
The data should follow a bivariate normal distribution
The p-value procedures for both Pearson and Spearman correlations are robust to departures from normality. The p-values are usually accurate for n ≥ 25, regardless of the parent population of the sample.
The confidence intervals for the Pearson correlation are sensitive to the normality of the underlying bivariate distribution. If the data deviate from normality, then the confidence intervals may be inaccurate regardless of the magnitude of the sample size.
The confidence intervals for Spearman correlations are based on ranks and are less sensitive to the underlying bivariate distribution assumption.
By using this site you agree to the use of cookies for analytics and personalized content.  Read our policy