# Possible alternatives if your data violate multiple linear regression assumptions

If the linear model is incorrect, if the Y values do not have constant variance, if the data for the Y variable for the regression come from a population whose distribution violates the assumption of normality, or outliers, high-leverage points, or high-influence points are present, then the multiple linear regression on the original data may provide misleading results, or may not be the best fit available. In such cases, fitting a different linear model or a nonlinear model, performing a weighted least squares linear regression, transforming the X or Y data or using a alternative regression method may provide a better analysis.

#### Alternative procedures include:

• Different linear model:
• Y may actually be best modelled by a linear function that includes other variables in addition to the current set of X variables, or a subset of the current set of X variables, or a subset of the current set of X variables plus one or more new X variables. If a graph of the residuals against the prospective X variable suggests a linear trend, then adding the new X variable to the model may provide a better model. A "new" X variable might be derived from one or more X variables already in the equation, such as using the square of X1 along with X1 to handle curvature in X1, or adding X1*X2 as a new variable to handle interaction between X1 and X2. In a situation of multicollinearity, a more useful model may actually involve removing one or more X variables, perhaps also adding one or more new ones. If there is a blocking variable such that there is potentially a different linear regression within each block, then some form of analysis of covariance may be a better model. In situations where there are multiple Y values measured at each combination of X values, this situation of implicit blocking can be dealt with by using the average of the different Y responses at each combination of X values and fitting the regression to this reduced data set. A possible drawback to this method is that by reducing the number of data points, the degrees of freedom associated with the residual error is reduced, thus potentially reducing the power of the test.
• Nonlinear model:
• If Y is actually best modelled by a nonlinear function of the X variables, especially if a nonlinear model is suggested on theoretical grounds, then a nonlinear regression using one of more X can be used to provide the best fit to the X-Y data. The shape of the X-Y plot for an individual X variable may suggest an appropriate function to use, such as a polynomial in X or an exponential model.linreg_poly.gif - image not available Transformations can also be used to deal with nonlinearity, but involving changing the metric (and possible normality) for either X and Y. However, a nonlinear model usually is more complex (more parameters) than a transformed linear model. If there are many parameters to fit and not very many data points, the precision of the fitted parameters for a more complex model may not be very good.
• Transformations:
• Transformations (a single function applied to each X data value for one or more of the X variables, or to each Y data value) are applied to correct problems of nonnormality or unequal variances, and sometimes multicollinearity. For example, taking logarithms of sample values can reduce skewness to the right. Transforming the Y values to remedy nonnormality often results in correcting heteroscedasticity (unequal variances). Occasionally, both X and Y variables are transformed. Unless scientific theory suggests a specific transformation a priori, transformations are usually chosen from the "power family" of transformations, where each value is replaced by x**p, where p is an integer or half-integer, usually one of:
• -2 (reciprocal square)
• -1 (reciprocal)
• -0.5 (reciprocal square root)
• 0 (log transformation)
• 0.5 (square root)
• 1 (leaving the data untransformed)
• 2 (square)

For p = -0.5 (reciprocal square root), 0, or 0.5 (square root), the data values must all be positive. To use these transformations when there are negative and positive values, a constant can be added to all the data values such that the smallest is greater than 0 (say, such that the smallest value is 1). (If all the data values are negative, the data can instead be multiplied by -1, but note that in this situation, data suggesting skewness to the right would now become data suggesting skewness to the left.) To preserve the order of the original data in the transformed data, if the value of p is negative, the transformed data are multiplied by -1.0; e.g., for p = -1, the data are transformed as x --> -1.0/x. Taking logs or square roots tends to "pull in" values greater than 1 relative to values less than 1, which is useful in correcting skewness to the right.

Another common transformation is the antilogarithm (exp(x)), which has effects similar to but more extreme than squaring: "drawing out" values greater than 1 relative to values less than 1.

Generally speaking, transformations of X are used to correct for non-linearity, and transformations of Y to correct for nonconstant variance of Y or nonnormality of the error terms. A transformation of Y to correct nonconstant variance or nonnormality of the error terms may also increase linearity. Transforming Y may change the error distribution from normal to nonnormal if the error distribution was normal to begin with.

A transformation of Y involves changing the metric in which the fitted values are analyzed, which may make interpretation of the results difficult if the transformation is complicated. If you are unfamiliar with transformations, you may wish to consult a statistician before proceeding.

The graph of the X-Y data may suggest an appropriate transformation of an X variable if the plot shows nonlinearity but constant error variance (that is, the general shape of the plot is not linear, but the vertical deviation in the data values appears constant over the range of X values).

If the X-Y plot suggests an arc from lower left to upper right so that data points either very low or very high in X lie below the trend suggested by the data, while the data points with middling X values lie on or above that trend, taking square roots or logarithms of the X values may promote linearity:

linreg_logx.gif - image not available

If the X-Y plot suggests an arc from upper left to lower right so that data points either very low or very high in X lie above the trend suggested by the data, while the data points with middling X values lie on or below that trend, taking reciprocals or reciprocals of the antilogarithms of the X values may promote linearity:

If the X-Y plot suggests an arc from lower left to upper right so that data points either very low or very high in X lie above the trend suggested by the data, while the data points with middling X values lie on or below that trend, taking squares or antilogarithms of the X values may promote linearity:

If the X-Y plot suggests an arc from upper left to lower right so that data points either very low or very high in X lie below the trend suggested by the data, while the data points with middling X values lie on or above that trend, taking squares or antilogarithms of the X values may promote linearity:

The choice of a transformation of Y may be suggested by examining the plot of residuals against fitted values. If this appears linear, but the variance of the residuals increases as fitted Y increases, suggesting a wedge or megaphone shape, then taking square roots, logarithms, or reciprocals of the Y values may promote homogeneity of variance:

If the plot of residuals against fitted values is a convex arc from lower left to upper right, and the variance of the residuals increases as fitted Y increases, then taking square roots of the Y values may promote homogeneity of variance:

If the plot of residuals against fitted values is a concave arc from upper left to lower right, and the variance of the residuals decreases as fitted Y increases, then taking logarithms of the Y values may promote homogeneity of variance:

When a transformation of Y is indicated, a simultaneous transformation of X variable(s) may also improve linearity of the fit with the transformed Y.

• Weighted least squares linear regression:
• If the plot of the residuals against fitted Y suggests heteroscedasticity (a wedge or megaphone shape instead of a featureless cloud of points), then a weighted linear regression may provide more precise estimates for the coefficients and intercept. The weights should be chosen to be proportional to the reciprocal of the variance. For example, if the variance is approximately proportional to the fitted Y, then weights inversely proportional to the fitted Y would be appropriate--these weights could be calculated by fitting an unweighted least squares linear regression, then using the reciprocals of the fitted values from the unweighted least squares linear regression as the weights for a weighted least squares linear regression. Alternatively, the weights could be chosen empirically as the reciprocals of the original Y values. Although weighted least squares linear regression may deal with unconstant variance in Y, it is sensitive to outliers just as unweighted least squares linear regression is.
• Alternative regression methods:
• One alternative method is to calculate the fit so as to minimize the sum of the absolute values of the residuals (instead of minimizing the sum of their squared values). Most alternative methods to least squares involve iteration to converge to the final fit, which can make them computationally intensive. And although alternative methods may be more robust or resistant than the least squares fit to departures from normality or to outliers, they are not necessarily immune. Unless it involves some form of weighting or trimming values, an alternative linear regression method will not address the problem of inequality of variances. Any alternative method for linear regression will assume that the Y observations are mutually independent, that the residuals have the same variance and are centered about 0, and that the linear model is in fact the correct one. If the Y values do indeed come from populations with normal distributions, with the Y variable having constant variance, and the linear model is correct, then the least squares estimates of the coefficients are unbiased and have the smallest variance among all unbiased estimates of the coefficients.
• Removing outliers:
• A common method of dealing with apparent outliers or high-leverage or high-influence data points in a regression situation is to remove those observations and then refit the regression to the remaining points. If the regression function is not substantially changed by the removal, then the fit to the remaining points will be improved without misrepresenting the data. However, if the outliers are due to a nonnormal distribution for the Y sample population, or to the underlying model being nonlinear, more can be learned by fitting a better model to the entire data (as by a nonlinear model, a linear model with additional X variables, a model with transformed X or Y, or an alternative method of fitting the multiple linear model) than by ignoring valid data values. And while removing a point that has a large residual may lead to a smaller residual variance for the new fitted linear function, it will not necessarily lead to a greater R-square value for the new fitted model, or to a smaller P value for the F test of overall fit.
• Mechanical methods:
• A common method of dealing with a large number of X variables is to use a stepwise regression routine or other mechanical method to identify the "best" set of X variables to use. Such methods assume that you have identified all the reasonable candidate X variables, and only need to choose among them. One method is simply to perform all possible linear regressions, which may be feasible if the number of candidate X variables is small. (For k X variables, there are (2**k)-1 regressions, assuming that at least one X variable will be used. For 4 X variables, this would be 15 possible regressions.) The regression equations with the smallest adjusted R-square values and small PRESS values can then be examined further to see which seem the most reasonable. If there are too many X variables to examine each possible regression, then stepwise regression is often used. This is a mechanical method that adds, deletes, or both adds and deletes X variables one at a time to arrive at a "best" regression equation. At each step, the decision to add or drop an X variable is based on a test of whether that variable will or does make a statistically significant contribution to the model. Stepwise regression identifies a single regression instead of several possible candidates. The particular set of X variables suggested by a mechanical method, especially a stepwise multiple regression, may be very dependent on the specific data values observed for X and Y. Such models should always be validated. First, make sure that the model makes sense theoretically, and is comparable to any results from fitting to other data sets. Then, try the model out on new data to see if it still holds. One validation method is to divide the data set into two parts, using one to fit the equation, and the other to decide whether it is a reasonable model. Validation is vital when using a stepwise procedure.
• Methods aimed at dealing with multicollinearity:
• One simple method of dealing with multicollinearity is to add more data observations, aiming at covering a wider range of values in the X variables. This may or may not be feasible. Sometimes it is clear that two or more X variables are measuring quantities that theoretically should be closely related, (such as HDL and total cholesterol, or area and volume), or that are each closely related to a variable that you did not or could not measure directly (many variables may be closely related to age, for example). In such cases, a more useful model may use only one of the group of such related X variables, so that the fitted coefficients will be less variable. In general, the fewest possible X variables that include the available information about Y should be included in the model, especially if that helps make the number of data observations at least 6 to 10 times the number of X variables. If the ratio of the total number of coefficients (including the intercept) to the total number of data points is greater than 0.4, it will often be difficult to fit a reliable model. More formal methods for dealing with multicollinearity include ridge regression, Bayesian regression, and regression with principal components. See Belsley et al. for more details. Multicollinearity may not be so serious a problem if the purpose of fitting the regression equation is predicting Y in the range of the X variables, rather than truly modeling the linear relationship between X and Y and estimating the values of the individual coefficients.