If the population from which data to be analyzed by a goodness of fit (chi-square) test were sampled violate one or more of the goodness of fit (chi-square) test assumptions, the results of the analysis may be incorrect or misleading. For example, if the assumption of independence is violated, then the goodness of fit (chi-square) test is simply not appropriate.
If the total sample size is small, then the expected values may be too small for the approximation involved in the chi-square test to be valid.
If it is not possible to cleanly assign each observation to exactly one cell (category) of the table, or if an ad hoc scheme is used to divide a continuous variable into discrete categories, then the results of the goodness of fit chi-square test may vary greatly depending on the exact apportionment of observations into cells of the table.
If the categories are ordered instead of nominal, especially if one or both of the classification variables is actually continuous rather than discrete, then a chi-square goodness of fit test may not be the most powerful test available, and this could mean the difference between detecting a true difference or not. Generally speaking, if you are testing against a well-known distribution like the normal distribution, there is likely to be a more powerful test tailored to that specific distribution, and which may not require you to completely specify the distribution function beforehand.
Often, the effect of an assumption violation on the test result depends on the extent of the violation.
Potential assumption violations include:
- Lack of independence: lack of independence
- Outliers: anomalous observations
- Structural zeroes: table cells that must be empty
- Special problems with small expected cell frequencies for the chi-square test
- Special problems with continuous variables
- Using the observed data to calculate the expected frequencies
- Lack of independence:
- Whether the observations are independent of each other is generally determined by the structure of the experiment from which they arise. A lack of independence within a sample is often caused by the existence of an implicit factor in the data. For example, values collected over time may be serially correlated (here time is the implicit factor). If the data are in a particular order, consider the possibility of dependence. (If the row order of the data reflect the order in which the data were collected, an index plot of the data [data value plotted against row number] can reveal patterns in the plot that could suggest possible time effects.) An implicit factor may also separate the data into different distributions of the same "family" (say, several different normal distributions). Each subsample would follow a distribution from the family, but the combined data would not fit a distribution from the family. For example, measurements for females may follow a normal distribution, and measurements for males may also follow a normal distribution, but the measurements for the entire population of both males and females may not follow a normal distribution. Depending on the relative proportions of sampled data from each underlying normal distribution, and on the means and variances of each distribution, the composite mixture distribution may appear to be skewed, or to have nonnormal kurtosis, or both. Separating the data into different subsamples based on the value of the implicit factor may reveal that, conditional on the value of the implicit factor (e.g., gender), the data are sampled from a normal distribution, even if it is a different distribution for each value of the implicit factor. Of course, an implicit factor may also separate the data into different distributions that do not all come from the same family. And if one of more of the subsamples has a small sample size, the test on the subsample may fail to detect a difference from the hypothesized distribution due to a lack of power.
- The chi-square statistic may be large due to the presence of outliers. Outliers are anomalous values in the data. They may be due to recording errors, which may be correctable, or they may be due to the sample not being entirely from the same population. If you find outliers in your data that are not due to correctable errors, you may wish to consult a statistician as to how to proceed.
- Structural zeroes:
- As long as the probability of falling into category i is non-zero, the expected value for that cell of the table will be greater than 0. If the total sample size small, or if there are many cells in the table, then it may happen that no observations are recorded for a particular cell. These zero values in a table are sampling zeroes. However, the actual process that creates the observations may produce cells in the table in which observations can never occur. The zero values that must occur in these cells are structural zeroes. The goodness of fit chi-square test is not designed for tables with structural zeroes. If you find structural zeroes in your data, you may wish to consult a statistician as to how to proceed.
- Special problems with small expected cell frequencies for the chi-square test:
- The chi-square test involves using the chi-square distribution to approximate the underlying exact distribution. The approximation becomes better as the expected cell frequencies grow larger, and may be inappropriate for tables with very small expected cell frequencies. For tables with expected cell frequencies less than 5, the chi-square approximation may not be reliable. A standard (and conservative) rule of thumb (due to Cochran) is to avoid using the chi-square test for tables with expected cell frequencies less than 1, or when more than 20% of the table cells have expected cell frequencies less than 5. Another rule of thumb (due to Roscoe and Byars) is that the average expected cell frequency should be at least 1 when the expected cell frequencies are close to equal, and 2 when they are not. (If the chosen significance level is 0.01 instead of 0.05, then double these numbers.) Koehler and Larntz suggest that if the total number of observations is at least 10, the number categories is at least 3, and the square of the total number of observations is at least 10 times the number of categories, then the chi-square approximation should be reasonable. Care should be taken when cell categories are combined (collapsed together) to fix problems of small expected cell frequencies. Collapsing can destroy evidence of non-independence, so a failure to reject the null hypothesis for the collapsed table does not rule out the possibility of non-independence in the original table. As with most statistical tests, the power of the chi-square test increases with a larger number of observations. If there are too few observations, it may be impossible to reject the null hypothesis even if it is false.
- Special problems with continuous variables:
- The goodness of fit chi-square test is specifically designed for observations classified into nominal categories. If the original data variable is actually continuous, then the variable must be divided into intervals to construct the table. The interval boundaries should be decided beforehand on the basis of theory or custom. If the intervals are determined by the particular data being analyzed, then the test statistic and corresponding P value may not be generalizable. Ideally, the categories should be chosen so that the expected cell frequencies are as equal to each other as possible. With equal expected cell frequencies, the chi-square statistic is unbiased, and the chi-square distribution is a closer approximation to the actual distribution of the calculated chi-square statistic. A rough rule of thumb, due to Mann and Wald, suggests that squaring the total number of values, taking the fifth root, and then doubling that, gives a reasonable number of categories to use, when the expected cell frequencies are equal. The chi-square test ignores any possible ordering of the variable categories. If the variable is continuous, then an alternative test to the chi-square may be preferable.
- Using the observed data to calculate the expected frequencies
- The goodness of fit chi-square test assumes that the expected values frequencies have been calculated without reference to the observed data. For example, if we are testing whether the observed data come from a normal distributions, then we specify beforehand what the mean and variance of that normal distribution are, and use those values to calculate the expected frequencies. If you use the observed data to calculate the expected frequencies, say using the observed data to find the mean and variance and then using those estimates to calculate the expected frequencies, then the goodness of fit chi-square test is not valid because the hypothesized distribution has already been adapted to the data to be tested. This makes the test less likely to reject the null hypothesis, even when it is false. In some cases where parameters for the hypothesized distribution function are estimated from the observed data, the chi-square test may be adjusted by subtracting 1 degree of freedom for every parameter estimated. However, the parameters must be estimated from the data in a certain way. Conover discusses this adjustment.