Logistic Regression

Home | StatGuide | Glossary


Logistic regression is used to fit a model to binary response (Y) data, such as whether a subject dies (event) or lives (non-event). These events are often described as success vs failure. For each possible set of values for the independent (X) variables, there is a probability p that a success occurs. The linear logistic model fitted by maximum likelihood is:
Y = b0 + b1*X1 + b2*X2 + ... + bk*Xk
where Y is the logit transformation of p:
Y = log(p/(1-p));
i.e., Y is the log odds corresponding to p.

If ni observations are made at the ith set of values of the X variables, then the count si of successes can be used to calculate Y by using the proportion of successes (s/n) in place of p.


Assumptions:

  • The linear function Yi = b0 + b1*X1i + b2*X2i + ... + bk*Xki + ei is the correct model, where Yi is the ith observed value of Y, Xji is the ith observed value of the jth X variable, and ei is the error term. Equivalently, the expected value of Y for a given set of values for the X variables is b0 + b1*X1 + b2*X2 + ... + bk*Xk. The intercept is b0, the expected value of Y when the value for each X variable is 0.
  • The logit transform of p to Y is the correct transformation to achieve the linear function for Y. (With the previous assumption, this amounts to assuming that the linear logistic model is the correct model.)
  • The Xj variable (predictor variable) values are fixed (i.e., none of the Xj is a random variable).
  • The counts of successes si are independent.
  • The response of each subject (success of failure) follows a Bernoulli distribution, independent of the other responses. This means that the si are distributed as binomial random variables with mean ni*pi. For a given set of X variable values, the variable s and thus the variable Y has constant mean.
  • The ei all have mean 0. (However, residuals analysis is done comparing observed and fitted values of s, not Y, so the ei are of little interest.)

Many of the hypothesis tests rely on large sample sizes, for which the maximum likelihood estimators will be approximately normally distributed, and will have at worst only small biases.

The X variables are also known as the independent variables.
The fitted Y variable is also known as the linear predictor.

If a discrete qualitative variable is included as a predictor, it is encoded by dummy X variables. A discrete variable with n different values will be encoded by n-1 dummy X variables.

The coefficients are bj, the amount by which the expected value of Y (the log odds) increases when Xj increases by a unit amount, when all the other X variables are held constant. This means that the estimated odds p/(1-p) are multiplied by exp(bj) when Xj increases by a unit amount. This interpretation of the coefficients does not hold if some of the X variables are functions of the others, such as an interaction term Xj*Xk.

Note that it is not assumed that the X variables are independent of each other.

The fitted value of p can be calculated from the linear predictor Y by using the formula
p = eY/(1 + eY).

Notation: Some references use the term Y to refer to the counts s, and use other notation for the linear predictor.


Guidance:

  • Ways to detect before performing the logistic regression whether your data violate any assumptions.
  • Ways to examine logistic regression results to detect assumption violations.
  • Possible alternatives if your data or logistic regression results indicate assumption violations.

To properly analyze and interpret results of logistic regression, you should be familiar with the following terms and concepts:

If you are not familiar with these terms and concepts, you are advised to consult with a statistician. Failure to understand and properly apply logistic regression may result in drawing erroneous conclusions from your data. Additionally, you may want to consult the following references:

  • Agresti, A. 1990. Categorical Data Analysis. New York: John Wiley & Sons.
  • Agresti, A. 1996. An Introduction to Categorical Data Analysis. New York: John Wiley & Sons.
  • Aldrich, J.H. and Nelson, F.D. 1984. Linear Probability, Logit, and Probit Models. Newbury Park, California: Sage Publications.
  • Collett, D. 1991. Modelling Binary Data. London: Chapman and Hall.
  • Cox, D.R. and Snell, E.J. 1989. The Analysis of Binary Data. 2nd ed. New York: John Wiley & Sons.
  • Demaris, A. 1992. Logit Modeling. Newbury Park, California: Sage Publications.
  • Everitt, B. S. 1992. The Analysis of Contingency Tables. 2nd ed. London: Chapman & Hall.
  • Hosmer, D.W. and Lemeshow, S. 1989. Applied Logistic Regression. New York: John Wiley & Sons.
  • McCullagh, P and Nelder, J.A. 1989. Generalized Linear Models. 2nd ed. London: Chapman and Hall.
  • Menard, S. 1995. Applied Logistic Regression Analysis. Newbury Park, California: Sage Publications.
  • Neter, J., Kutner, M.H., Nachtsheim, C.J., and Wasserman, W. 1996. Applied Linear Regression Models. 3rd ed. Chicago: Irwin.

Glossary | StatGuide Home | Home