# Does your data violate paired t test assumptions?

If the population from which paired differences to be analyzed by a paired t test were sampled violate one or more of the paired t test assumptions, the results of the analysis may be incorrect or misleading. For example, if the assumption of independence for the paired differences is violated, then the paired t test is simply not appropriate.

Note that the two values that make up each paired difference need not be independent, and in fact are expected to be correlated, such as before and after measurements. If you treat paired data as coming from two independent samples, such as doing an inappropriate two-sample unpaired t test instead of a paired t test, then you may sacrifice power.

If the assumption of normality is violated, or outliers are present, then the paired t test may not be the most powerful test available, and this could mean the difference between detecting a true difference or not. A nonparametric test or employing a transformation may result in a more powerful test. For example, if the distribution of the paired differences is not symmetric, a transformation may produce symmetry.

Often, the effect of an assumption violation on the paired t test result depends on the extent of the violation (such as how skewed the distribution of the paired differences is). Some small violations may have little practical effect on the analysis, while other violations may render the paired t test result uselessly incorrect or uninterpretable. In particular, small sample sizes can increase vulnerability to assumption violations.

#### Potential assumption violations include:

• Implicit factors:
• A lack of independence within a sample is often caused by the existence of an implicit factor in the data. For example, values collected over time may be serially correlated (here time is the implicit factor). If the data are in particular order, consider the possibility of dependence. (If the row order of the data reflect the order in which the data were collected, an index plot of the data [data value plotted against row number] can reveal patterns in the plot that could suggest possible time effects.)
• Outliers:
• Values may not be identically distributed because of the presence of outliers. Outliers are anomalous values in the data. Outliers tend to increase the estimate of sample variance, thus decreasing the calculated t statistic and lowering the chance of rejecting the null hypothesis. They may be due to recording errors, which may be correctable, or they may be due to the sample not being entirely from the same population. Apparent outliers may also be due to the values being from the same, but nonnormal, population. The boxplot and normal probability plot (normal Q-Q plot) may suggest the presence of outliers in the data. The paired t statistic is based on the sample mean and the sample variance of the paired differences, both of which are sensitive to outliers. (In other words, neither the sample mean nor the sample variance is resistant to outliers, and thus, neither is the t statistic.) In particular, a large outlier can inflate the sample variance, decreasing the t statistic and thus perhaps eliminating a significant difference. A nonparametric test may be a more powerful test in such a situation. If you find outliers in your data that are not due to correctable errors, you may wish to consult a statistician as to how to proceed.
• Interaction between pairs and treatments:
• A paired difference is the change from the first value in the pair to the second value in the pair. The paired t test assumes that the size of the paired difference is not dependent on the identity of the pair. In particular, the size of the paired difference is assumed to be independent of the size of the first value in the pair. For example, if a paired t test is performed comparing blood pressure before and after a drug treatment, then the average change produced by the drug should be the same for those who start with high blood pressure as those who start with normal blood pressure. If this is not the case, then the simple additive model assumed by the paired t test is incorrect, and there is interaction between pairs and the treatment groups (before and after). The plot of residuals against fitted values may help detect such interaction. The plot of observed values against sample (treatment) number may be even more useful in detecting interaction. If there is no interaction, the line segments (one for each pair) should be parallel or nearly so.
• Skewness:
• If the population from which the paired differences were sampled is skewed, then the paired t test may incorrectly reject the null hypothesis that the mean of the paired differences is 0 even when it is true. The paired signed rank test also assumes symmetry, and may not be appropriate alternative in this case. The paired sign test does not rely on symmetry, and may be an appropriate alternative test. Unless the skewness is severe, or the sample size very small, the t test may perform adequately. Paired differences are often symmetric even when the two populations producing the values that make up the paired differences are both unsymmetric, provided that those two populations have similar skewness. For example, two very positively skewed distributions that differ only by location will produce a set of paired differences that are symmetric about 0, and perfectly suitable for the paired t test. This is often the case with before and after measurements. Whether or not the population of the paired differences is skewed can be assessed either informally (including graphically), or by examining the sample skewness statistic or conducting a test for skewness. If outliers or skewness is present, employing a transformation may resolve both problems at once, and also promote normality. In this case, it may be preferable to perform a paired t test on the transformed data. The usual measurement for skewness is not resistant to outliers, so one should be consider the possibility that apparent skewness is in fact due to one or more outliers. A lack of power due to small sample sizes may also make it hard to detect skewness.
• Nonnormality:
• The values in a sample may indeed be from the same population, but not from a normal one. Signs of nonnormality are skewness (lack of symmetry) or light-tailedness or heavy-tailedness. The boxplot, histogram, and normal probability plot (normal Q-Q plot), along with the normality test, can provide information on the normality of the population distribution. However, if there are only a small number of data points, nonnormality can be hard to detect. If there are a great many data points, the normality test may detect statistically significant but trivial departures from normality that will have no real effect on the t statistic (since the t statistic will converge in probability to the standard normal distribution by the law of large numbers). For data sampled from a normal distribution, normal probability plots should approximate straight lines, and boxplots should be symmetric (median and mean together, in the middle of the box) with no outliers. If the sample size for the paired differences is not too small, then the t statistic will not be much affected even if the population distributions are skewed, although it will increase the chance that an incorrectly small P value will be reported (i.e., that the null hypothesis will be rejected when it is in fact true. Unless the sample size for the paired differences is small (less than 10), light-tailedness or heavy-tailedness will have little effect on the t statistic. Light-tailedness will tend to increase the chance that an incorrectly small P value will be reported (i.e., that the null hypothesis will be rejected when it is in fact true. Heavy-tailedness will tend to increase the chance that an incorrectly large P value will be reported (i.e., that the null hypothesis will not be rejected when it is in fact false, making the test conservative. Paired differences will often be symmetric even when they arise from two skewed distributions, although such paired differences may be heavy-tailed. Robust statistical tests operate well across a wide variety of distributions. A test can be robust for validity, meaning that it provides P values close to the true ones in the presence of (slight) departures from its assumptions. It may also be robust for efficiency, meaning that it maintains its statistical power (the probability that a true violation of the null hypothesis will be detected by the test) in the presence of those departures. The t test is fairly robust for validity against nonnormality, but it may not be the most powerful test available for a given nonnormal distribution, although it is the most powerful test available when its test assumptions are met. In the case of nonnormality, a nonparametric test or employing a transformation may result in a more powerful test.
• Patterns in plot of data:
• Outliers may appear as anomalous points in a graph of the paired differences against their mean. A boxplot or normal probability plot of the paired differences can also reveal lack of symmetry and suspected outliers.
• Special problems with small sample sizes:
• If the number of paired differences is small, it may be difficult to detect assumption violations. With small samples, violation assumptions such as nonnormality are difficult to detect even when they are present. Also, with small sample size(s) there is less resistance to outliers, and less protection against violation of assumptions. Even if none of the test assumptions are violated, a t test with small sample sizes may not have sufficient power to detect a significant departure from 0 of the mean of the paired differences, even if this is in fact the case. The power curve presented in the results of the t test indicates how likely the test would be to detect an actual difference between 0 and the mean of the paired differences.

The shallower the power curve, the bigger the actual difference would have to be before the t test would detect it. The power depends on variance, the selected significance (alpha-) level of the test, and the sample size. Power decreases as the variance increases, decreases as the significance level is decreased (i.e., as the test is made more stringent), and increases as the sample size increases. A very small sample from a population of paired differences with a mean very different from 0 may not result in a significant t test statistic unless the variance of the paired differences is small. If a statistical significance test with small sample sizes produces a surprisingly non-significant P value, then a lack of power may be the reason. The best time to avoid such problems is in the design stage of an experiment, when appropriate minimum sample sizes can be determined, perhaps in consultation with a statistician, before data collection begins.