# Does your data violate rank sum test assumptions?

If the populations from which data to be analyzed by a Mann-Whitney rank sum test were sampled violate one or more of the rank sum test assumptions, the results of the analysis may be incorrect or misleading. For example, if the assumption of independence is violated, then the Mann-Whitney rank sum test is simply not appropriate, although another test (perhaps the Wilcoxon paired signed rank) may be appropriate.

If outliers are present, or if the data in fact come from a normal distribution, then the rank sum test may not be the most powerful test available, and this could mean the difference between detecting a true difference or not. Another nonparametric test, the unpaired two-sample t test, or employing a transformation may result in a more powerful test. If the population dispersions are unequal, a transformation may produce comparable dispersions.

Often, the effect of an assumption violation on the rank sum test result depends on the extent of the violation (such as the how unequal the population disperions are, or how skewed one or the other population distribution is). Some small violations may have little practical effect on the analysis, while other violations may render the rank sum test result uselessly incorrect or uninterpretable. In particular, small sample sizes can increase vulnerability to assumption violations.

#### Potential assumption violations include:

• Implicit factors:
• A lack of independence within a sample is often caused by the existence of an implicit factor in the data. For example, values collected over time may be serially correlated (here time is the implicit factor). If the data are in a particular order, consider the possibility of dependence. (If the row order of the data reflect the order in which the data were collected, an index plot of the data [data value plotted against row number] can reveal patterns in the plot that could suggest possible time effects.)
• Lack of independence:
• Whether the two samples are independent of each other is generally determined by the structure of the experiment from which they arise. Obviously correlated samples, such as a set of pre- and post-test observations on the same subjects, are not independent, and such data would be more appropriately tested by a two-sample paired test. If you are unsure whether your samples are independent, you may wish to consult a statistician or someone who is knowledgeable about the data collection scheme you are using.
• Outliers:
• Values may not be identically distributed because of the presence of outliers. Outliers are anomalous values in the data. They may be due to recording errors, which may be correctable, or they may be due to the sample not being entirely from the same population. Apparent outliers may also be due to the values being from the same, but skewed or heavy-tailed population. Outliers tend to increase the estimate of sample variation, and might lead to an incorrect conclusion that the dispersions of the two samples are not equal if the outlier is the result of a recording or measurement error. Because the statistic for the rank sum test is resistant, it will not be substantially affected by the presence of outliers unless the number of outliers becomes large relative to the sample size. The boxplot and normal probability plot (normal Q-Q plot) may suggest the presence of outliers in the data. If you find outliers in your data that are not due to correctable errors, you may wish to consult a statistician as to how to proceed.
• Unequal population dispersions:
• The inequality of the population dispersions can be assessed by examination of the relative size of the sample variations, either informally (including graphically), or by a variance test such as the Ansari-Bradley test. If both outliers and unequal dispersions are present, employing a transformation may resolve both problems at once, and also promote normality. In this case, it may be preferable to perform an unpaired two-sample t test on the transformed data, as the t test has slightly more power than the rank sum test if the assumption of normality holds. (The rank sum test has about 95% efficiency compared to the unpaired t test if the assumption is in fact correct.) The usual measurement for sample variance is not resistant to outliers, while the Ansari-Bradley test is less subject to influence by outliers. For this reason, the Ansari-Bradley test may not reject equality of dispersions even when the sample variances seem to be substantial different. A lack of power due to small sample sizes may also lead to this situation.
• Dissimilar distributional shapes:
• If the assumptions for the samples' population distributions are correct, the skewnesses of the two samples should be comparable, and if either sample suggests heavy tails or light tails (or neither) the other sample should suggest the same. Differences in distributional shapes can be assessed by examination of the data, as with boxplots, histograms, and normal probability plots. Differing results for each sample for the normality test also suggest the possibility of differing distributional shapes.
• Patterns in plot of data:
• If the assumptions for the samples' population distributions are correct, the plot of either sample's values against its mean or median (or its sample ID) should suggest a horizontal band across the graph. Because there are only two unique sample means/medians or sample ID values, this type of graph will consist of two vertical "stacks" of data points; the stacks should be about the same length. Outliers may appear as anomalous points in the graph. A fan pattern like the profile of a megaphone, with a noticeable flare either to the right or to the left as shown in the picture (one of the "stacks" of data points is much longer than the other), suggests that the variation in the values increases in the direction the fan pattern widens (usually as the sample mean increases), and this in turn suggests that a transformation may be needed. Side-by-side boxplots of the two samples can also reveal lack of homogeneity of dispersion if one boxplot is much longer than the other, and reveal suspected outliers. • Special problems with small sample sizes:
• If one or both of the sample sizes is small, it may be difficult to detect assumption violations. With small samples, violation assumptions such as inequality of dispersions are difficult to detect even when they are present. Also, with small sample size(s) there is less resistance to outliers, and less protection against violation of assumptions. Even if none of the test assumptions are violated, a rank sum test with small sample sizes may not have sufficient power to detect a significant difference between the two samples, even if the medians are in fact different. Power decreases as the significance level is decreased (i.e., as the test is made more stringent), and increases as the sample size increases. With very small samples, even samples from populations with very different medians may not produce a significant rank sum test statistic. If a statistical significance test with small sample sizes produces a surprisingly non-significant P value, then a lack of power may be the reason. The best time to avoid such problems is in the design stage of an experiment, when appropriate minimum sample sizes can be determined, perhaps in consultation with a statistician, before data collection begins.