Does your data violate life table assumptions?

Home | StatGuide | Glossary

If the populations from which data to be analyzed by a life table were sampled violate one or more of the life table assumptions, the results of the analysis may be incorrect or misleading. For example, if the assumption of independence of censoring times is violated, then the estimates for survival may be biased and unreliable. If there are factors unaccounted for in the analysis that affect survival and/or censoring times, then the life table may not give useful estimates for survival.

Some small violations may have little practical effect on the analysis, while other violations may render the life table results uselessly incorrect or uninterpretable. In particular, lengthy time intervals and small sample sizes may increase the effect of assumption violations. Heavy censoring may also affect the reliability of the life table estimates.

Potential assumption violations include:

  • Implicit factors:
  • Lack of independence within a sample is often caused by the existence of an implicit factor in the data. For example, if we are measuring survival times for cancer patients, diet may be correlated with survival times. If we do not collect data on the implicit factor(s) (diet in this case), and the implicit factor has an effect on survival times, then we in effect no longer have a sample from a single population, but a sample that is a mixture drawn from several populations, one for each level of the implicit factor, each with a different survival distribution. Implicit factors can also affect censoring times, by affecting the probability that a subject will be withdrawn from the study or lost to follow-up. For example, younger subjects may tend to move away (and be lost to follow-up) more frequently than older subjects, so that age (an implicit factor) is correlated with censoring. If the sample under study contains many younger people, the results of the study may be substantially biased because of the different patterns of censoring. This violates the assumption that the censored values and the noncensored values all come from the same survival distribution.
  • Lack of independence of censoring:
  • If the pattern of censoring is not independent of the survival times, then survival estimates may be too high (if subjects who are more ill tend to be withdrawn from the study), or too low (if subjects who will survive longer tend to drop out of the study and are lost to follow-up). If a loss or withdrawal of one subject could tend to increase the probability of loss or withdrawal of other subjects, this would also lead to lack of independence between censoring and the subjects. The estimates for the survival functions and their variances rely on independence between censoring times and survival times. If independence does not hold, the estimates may be biased, and the variance estimates may be inaccurate.
  • Lack of uniformity within a time interval:
  • The life table estimates for the survival functions and for their standard errors rely on the assumptions that the probability of survival is constant within each interval (although it may change from interval to interval), and that the censored values in an interval are uniformly distributed throughout the interval. The estimates calculate the equivalent number of subjects exposed (at risk) in an interval by assuming that censored subjects were, on the average, at risk for half the interval. If subjects tend to be censored more toward the beginning of an interval, then this estimate of then number of subjects at risk is too high, and the survival estimate for that interval will be too low. If the survival rate changes during the course of an interval, then the survival estimates for that interval will not be reliable or informative.
  • Effects of grouping:
  • Any estimation procedure that relies on grouped data is vulnerable to distortion from the grouping algorithm. The intervals for a life table should be chosen before the data are collected, so that the interval boundaries will be independent of the observed data. The wider (longer) a time interval, the less likely it is that the assumption of a constant survival rate throughout the interval will be reasonable. A common rule of thumb is that there should be at least 8 to 10 intervals. If there are many censored values, it is particularly important that the number of time intervals not be too small. On the other hand, an interval with very few subjects in it will not have reliable variance estimates for the survival functions, and the calculated variance will tend to underestimate the true variance. If there are few subjects left alive in the final intervals of a study, then the variance estimates for those intervals should not be given as much credence as those for earlier intervals with more patients.
  • Many censored values:
  • A study may end up with many censored values, from having large numbers of subjects withdrawn or lost to follow-up, or from having the study end while many subjects are still alive. Large numbers of censored values decrease the equivalent number of subjects exposed (at risk), making the life table estimates less reliable than they would be for the same number of subjects with less censoring. Moreover, if there is heavy censoring, the survival estimates may be biased, and the estimated variances become poorer approximations, perhaps considerably smaller than the actual variances. On the other hand, with high levels of censoring, it is also important to avoid having only a small number of intervals. A high censoring rate may also indicate problems with the study: ending too soon (many subjects still alive at the end of the study), or a pattern in the censoring (many subjects withdrawn at the same time, younger patients being lost to follow-up sooner than older ones, etc.)
  • Patterns in plots of data:
  • If the assumptions for the censoring and survival distributions are correct, then a plot of either the censored or the noncensored values (or both together) against time should show no particular patterns, nor patterns within the time intervals. Obviously, this sort of graph can only be constructed when the individual values are known.
  • Special problems with small sample sizes:
  • If the sample size is small, it becomes particularly difficult to create time intervals that have enough subjects in them to provide reliable estimates of the survival functions and their variances while still being short enough to justify the assumption of a constant survival rate within each interval. A small sample size also makes it more difficult to detect possible dependencies between censoring and survival, or the presence of implicit factors. If the number of subjects exposed (at risk) in an interval or the number of subjects that survived to the beginning of that interval is small, the variance estimates for the survival functions will tend to underestimate the actual variance. This situation is most likely to occur for later intervals, when most subjects have either died or been censored, so that the variance estimates for later intervals are generally less reliable than those for earlier intervals.

Glossary | StatGuide Home | Home