If the populations from which data for a goodness of fit (chi-square) test were sampled violate one or more of the chi-square test assumptions, the results of the analysis may be incorrect or misleading. If there are factors unaccounted for in the analysis, then the chi-square test may not give useful results. In such cases, stratification may provide a better analysis. Alternatively, you can use a test specifically tailored to the family of your hypothesized distribution, such as tests for normality.
Data from a continuous distribution might better match a specific hypothesized distribution after transformation.
All of these alternatives require that you have access to the original individual data values.
- Stratification involves dividing a sample into subsamples based on one or more characteristics of the population. For example, a sample may be stratified by gender. If the distribution function is different for the different strata, then the characteristic used for stratification may be an implicit factor, and a separate analysis for each individual subsample may be more informative than an analysis of the entire sample. A potential drawback with stratification is that one or more of the subsamples may be small in size, leading to problems with the reliability of the test results. Also, the results for each subsample are generalizable to only a part of the sample population.
- Testing against specific distributions:
- The goodness of fit chi-square test is extremely versatile: If you can determine what the expected frequencies should be to correspond with the observed frequencies, then you can calculate the test. However, because the test is so general, it is usually not the most powerful test available for a specific distribution, particularly if the distribution is continuous. With a continous distribution, there is the added problem of deciding how to divide the data into discrete categories before applying the test. One alternative to using the chi-square test is to choose a test specifically tailored to the distribution of interest. The Kolmogorov-Smirnov test is commonly used to test whether the population distribution follows a specified continuous distribution, such as the uniform or normal. When the hypothesized distribution is a normal distribution, there are a number of tests for normality available. Some of these tests, such as the Shapiro-Wilk test have the added advantage that you need not specify the mean and variance of the hypothesized normal distribution beforehand. In general, if there is a test available that is tailored to your hypothesized distribution, you should prefer that to using the chi-square goodness of fit test.
- A transformation of the data may create a data set that more closely approximates that from the hypothesized distribution distribution. Or theory may suggest that transformed data should follow a hypothesized distribution that is easier to work with (say, for calculating the expected frequencies) than the hypothesized distribution for the original data. Transformations (a single function applied to each data value) are often applied to correct problems of skewness or heavy tails. For example, taking logarithms of sample values can reduce skewness to the right. Unless scientific theory suggests a specific transformation a priori, transformations are usually chosen from the "power family" of transformations, where each value is replaced by x**p, where p is an integer or half-integer, usually one of:
- -2 (reciprocal square)
- -1 (reciprocal)
- -0.5 (reciprocal square root)
- 0 (log transformation)
- 0.5 (square root)
- 1 (leaving the data untransformed)
- 2 (square)
For p = -0.5 (reciprocal square root), 0, or 0.5 (square root), the data values must all be positive. To use these transformations when there are negative and positive values, a constant can be added to all the data values such that the smallest is greater than 0 (say, such that the smallest value is 1). (If all the data values are negative, the data can instead be multiplied by -1, but note that in this situation, data suggesting skewness to the right would now become data suggesting skewness to the left.) To preserve the order of the original data in the transformed data, if the value of p is negative, the transformed data are multiplied by -1.0; e.g., for p = -1, the data are transformed as x --> -1.0/x. Taking logs or square roots tends to "pull in" values greater than 1 relative to values less than 1, which is useful in correcting skewness to the right. Transformation involves changing the metric in which the data are analyzed, which may make interpretation of the results difficult if the transformation is complicated. If you are unfamiliar with transformations, you may wish to consult a statistician before proceeding.