Contingency table analysis

Home | StatGuide | Glossary


For cross-classified data, the Pearson chi-square test for independence and Fisher's exact test can be used to test the null hypothesis that the row and column classification variables of the data's two-way contingency table are independent.


Assumptions:

  • The exact assumptions and null hypothesis for the Pearson chi-square test for independence depend on the sampling scheme used, although the calculated statistic is the same in each case. There are three possible sample schemes for the values in a contingency table with R rows and C columns:
  • Sampling Scheme 1: The total number of data values in the contingency table (N) is fixed, but none of the row or column totals are fixed.
  • This sampling scheme is known as cross-sectional, naturalistic, or multinomial sampling. In this case, the assumptions are:
  • The data observations are made on a random sample of N objects, cross-classified according to two attributes, the row variable and the column variable.
  • The sampled values are independent.
  • Each object is classified into one and only one category of the row variable, and into one and only one category of the column variable and the null hypothesis is: The event of an observation being in a particular row is independent of that same observation being in a particular column.
  • Sampling Scheme 2:The total number of data values in the contingency table (N) is fixed, and either the row marginal totals or the column marginal totals are fixed.
  • If one of the attributes is viewed as an outcome variable and the other as an explanatory variable (e.g., if one variable is the occupation of the parent and the other is the occupation of the child), then the study is retrospective or a case-control study if the marginal totals are fixed for the outcome variable, and the study is prospective if the marginal total are fixed for the explanatory variable. If the r row marginal totals are fixed such that row i has n[i] observations in it, the assumptions are:
  • The data observations are made on r random samples, with n[i] values in the ith sample.
  • Sample i is taken from objects that have the ith value of the row attribute.
  • Within each sample, the values are independent.
  • The r samples are independent.
  • Each object is classified into one and only one category of the column variable.

And the null hypothesis is:

  • For any given row, the probability of an observation from that row being in a particular column is the same for all columns.
  • Sampling Scheme 3:The total number of data values in the contingency table (N) is fixed, and both the row marginal totals are the column marginal totals are fixed.
  • This is also the sampling scheme assumed by Fisher's exact test. If the row marginal totals and the column marginal totals are fixed, the assumptions are:
  • Each object is classified into one and only one category of the row variable, and into one and only one category of the column variable.
  • The N observations come from a random sample such that each observation has the same probability of being classified into the ith row and the jth column as any other observation.
  • And the null hypothesis is:
  • The event of an observation being in a particular row is independent of that same observation being in a particular column.
  • The Pearson chi-square test involves using the chi-square distribution to approximate the underlying exact distribution. Although the chi-square approximation can be used in all three sampling schemes, the approximation becomes less good when marginal totals are fixed. The best approximation will be most likely be in the first (multinomial) sampling scheme. The approximation becomes better as the expected cell frequencies grow larger, and may be inappropriate for contingency tables with very small expected cell frequencies. In case of a 2x2 contingency table, an adjusted value of the chi-square statistic (the Yates corrected chi-square) is often used to correct for a continuous distribution (chi-square) being used to approximate the very discrete distribution of the values in the 2x2 table. The purpose of the correction is to produce P values that are closer to those that would be calculated by the exact (Fisher) test.
  • The Pearson, likelihood-ratio (deviance), and randomization chi-square tests all approximate the same chi-square distribution asymptotically (as the total sample size gets large). The Pearson chi-square test is always more conservative than the randomization chi-square test, and tends to be more conservative than the likelihood-ratio chi-square test. For a 2x2 table, the Pearson chi-square test tends to be more conservative than the exact (Fisher's) test, and the likelihood-ratio chi-square tends to be less conservative than the exact test (and thus more likely to erroneously reject the null hypothesis).
  • Fisher's exact test assumes that the total number of data values in the 2x2 contingency table (N) is fixed, and both the row marginal totals and the column marginal totals are fixed.
  • If the 2 row marginal totals are fixed and the 2 column marginal totals are fixed, the assumptions for Fisher's exact test are:
  • Each object is classified into one and only one category of the row variable, and into one and only one category of the column variable.
  • The N observations come from a random sample such that each observation has the same probability of being classified into the ith row and the jth column as any other observation.

And the null hypothesis is:

The event of an observation being in a particular row is independent of that same observation being in a particular column.

  • Among measures of association for two-way contingency tables, Kendall's Tau B, Tau C, Spearman's rho, and Gamma assume that both the row and column variables have ordered categories (such as disease severity categories).
  • Cross-classification schemes for two-way contingency tables work best when the categories for both variables are discrete (e.g., gender). When a continuous variable such as age is divided into intervals to form the categories of a variable, the interval boundaries should be decided beforehand on the basis of theory or custom. The intervals should not be determined by the particular data being analyzed.

Guidance:

  • Ways to detect before performing a contingency table analysis whether your data violate any assumptions.
  • Ways to examine contingency table analysis results to detect assumption violations.
  • Possible alternatives if your data or contingency table analysis results indicate assumption violations.

To properly analyze and interpret results of the contingency table analysis, you should be familiar with the following terms and concepts:

Failure to understand and properly apply contingency table analysis may result in drawing erroneous conclusions from your data. If you are not familiar with these terms and concepts, you may wish to consult with a statistician. You may also want to consult the following references:

  • Agresti, A. 1990. Categorical Data Analysis. New York: John Wiley & Sons.
  • Agresti, A. 1996. An Introduction to Categorical Data Analysis. New York: John Wiley & Sons.
  • Bishop, Y. M. M., Fienberg, S. E., and Holland, P. W. 1975. Discrete Multivariate Analysis. Cambridge, MA: MIT Press.
  • Brownlee, K. A. 1965. Statistical Theory and Methodology in Science and Engineering. New York: John Wiley & Sons.
  • Conover, W. J. 1980. Practical Nonparametric Statistics. 2nd ed. New York: John Wiley & Sons.
  • Daniel, Wayne W. 1978. Applied Nonparametric Statistics. Boston: Houghton Mifflin.
  • Daniel, Wayne W. 1995. Biostatistics. 6th ed. New York: John Wiley & Sons.
  • Everitt, B. S. 1992. The Analysis of Contingency Tables. 2nd ed. London: Chapman & Hall.
  • Koehler, K. and Larntz K. 1980. An empirical investigation of goodness-of-fit statistics for sparse multinomials. Journal of the American Statistical Association 75: 336-344.
  • Lehmann, E. L. 1975. Nonparametrics: Statistical Methods Based on Ranks. San Francisco: Holden-Day.
  • Rosner, Bernard. 1995. Fundamentals of Biostatistics. 4th ed. Belmont, California: Duxbury Press.
  • Sokal, Robert R. and Rohlf, F. James. 1995. Biometry. 3rd. ed. New York: W. H. Freeman and Co.
  • Tocher, K.D. 1950. Extension of the Neyman-Pearson theory of tests to discontinuous variates. Biometrika 37: 130-144.
  • Zar, Jerrold H. 1996. Biostatistical Analysis. 3rd ed. Upper Saddle River, NJ: Prentice-Hall.

Glossary | StatGuide Home | Home