Examining multiple linear regression results to detect assumption violations

All the following results are provided as part of a multiple linear regression analysis.

Results for residuals:

• Graphs of the fitted values against X:
• If the assumption of equal variances for the Y is correct, the plot of fitted Y against each X should suggest a band across the graph with roughly equal vertical width for all values of X. (That is, the shape of the graph should suggest a tilted cigar and not a wedge or a megaphone.) A wedge-shaped fan pattern like the profile of a megaphone, with a noticeable flare either to the right or to the left as shown in the picture suggests that the variance in the values increases in the direction the fan pattern widens (usually as X increases), and this in turn suggests that a transformation of the X or Y values or a weighted least squares linear regression, may be appropriate. Points that are far from the others may be outliers in the data, or may suggest a nonnormal population distribution for Y. If an outlier is a high-leverage point, it may pull the fitted function toward it and perhaps away from the main body of the data, and may not appear as an outlier in the plot of fitted Y against X. Alternatively, a high-leverage point may make other points appear to be outliers by drawing the fitted function toward itself. You may be able to gain additional insight from examining plots of the observed Y values against individual X variables before you perform the regression. The plots below illustrate four different scenarios for plots of the observed Y against individual X: 1. A linear relationship between X and Y seems reasonable. 2. The points seem to follow a curve, not a straight line; a linear relationship between X and Y does not appear to be appropriate for these data, A transformation may create a data set for which a linear fit is appropriate, or a nonlinear model may provide a better fit.

3. The majority of the points seem to follow a linear trend, but there is an outlier which may cause the fitted equation to lie such that it does not provide a good fit to the majority of the data points. An alternative regression method may provide a better fit. The outlying data point should also have its X and Y values doublechecked, in case a recording error has been made. 4. The majority of the points lie on a vertical straight line, and only the presence of an outlier has created any variation in X. This situation may cause the fitted equation to go through the one outlier, so that it will not turn up as a large residual. These examples demonstrate the importance of examining plots of the data whenever a regression is to be done.

• High-leverage points:
• A high-leverage point is one that exerts a great deal of influence on the path of the fitted equation. For the fitted equation, the centroid of the data (the point at means of the Xs and mean of Y) acts as a fulcrum, and the fitted function pivots toward high leverage points, perhaps fitting the main body of the data poorly. A data point that is extreme in Y, but lies near the center of the range of one of the X variables, may not have much effect on that X variable's fitted coefficient, but by changing the estimate of the mean of Y, it may affect the fitted estimate of the coefficients for other X variables. If a point has high leverage, then removing it can have a substantial effect on the estimates of the coefficients, and on the fitted values of Y, especially if the point also has a relatively large residual. A alternative fitting method other than least squares may be a better method in such a situation. An observation with leverage greater than 2p/n, where p is the number of coefficients (including the intercept), and n the number of observations, is a high-leverage point, and is likely to be an outlier. (The average value of leverage is p/n.) Other potential signs of high leverage for a observation are if one observation has a much greater leverage value than all the others, or if its leverage is greater than 0.5. Because points with high leverage pull the fitted equation toward them, they may have small residuals, and thus not stand out in a plot of residuals against fitted values. A raw residual can be adjusted for the leverage for the corresponding observation in various ways, producing internally studentized residuals, deleted residuals, and externally studentized residuals, also known as studentized deleted residuals. In each case, points with high leverage will tend to have larger adjusted residuals than raw residuals. An observation with a studentized deleted residual greater than 2 in absolute value is likely to be an outlier in Y.

In cases of severe multicollinearity, it may not be possible to calculate some of the diagnostic measures of leverage or influence. These diagnostics also are not calculated if the fit is exact.

• High-influence points:
• Besides leverage, there are other measures of whether a single observation has a great deal of influence on the path of the fitted equation. In general, the idea is to compare the fit with and without that observation, and produce a measure of how much the fit is affected by removing that point. DFFITS measures how much the value fitted Y changes when the ith point is removed from the data set. Large absolute values of DFFITS (greater than 1 for smaller data sets or greater than twice the square root of p/n, where p is the number of coefficients including the intercept, and n the number of data points) suggest that the corresponding data point is influential. Cook's distance measures the combined influence of the ith point on all the regression coefficients. It takes on greater values for data points with large residuals, large leverage values, or both. COVRATIO measures the change in the variance-covariance matrix with and without the ith point. It takes on greater values for data points with large leverage values, and tends to be small when when the studentized deleted residual is large.

DFBETAS measure the influence of a data point on a particular coefficient. A Large absolute value of DFBETAS(i,j) (greater than 1 for smaller data sets or greater than twice the square root of 1/n, where n is the number of data points) suggests that the ith point influences the jth coefficient. If two or more influential points are near each other, then each may mask the effect of deleting the other(s), and then none of them may have a large value for these influence measures. You may be able to spot such clumps of points in graphs of Y, fitted Y, or residuals vs individual X. In cases of severe multicollinearity, it may not be possible to calculate some of the diagnostic measures of leverage or influence. These diagnostics also are not calculated if the fit is exact.

• Test of goodness of fit:
• If the linear model is in fact the correct one, then the overall F test for fit tests the null hypothesis that the all the coefficients for the Xs are 0 (i.e., that knowledge of the X variables does not allow for better prediction of Y than knowledge of Y alone, since the fit with all their coefficients equal to 0 is fitted Y = mean of Y). However, if the number of data points is small, or the variation in observed X is small (perhaps because the range of the observed X is restricted), or the residual variance is large, the test may not have enough power to detect a non-zero coefficient, leading to a nonsignificant test result. (If the variance of Y is large enough, as in the graph below, then determining any useful model may be impossible.) linreg_obscure.gif - image not available A failure of the test for fit to reject the null hypothesis of zero coefficients may also happen when the linear model is not appropriate. Conversely, a significant test result does not necessarily mean that the linear model is the correct one, only that fitting a multiple linear function provides a better estimate of Y than simply using the mean of Y. The R-square statistic and the multiple correlation coefficient are descriptive measures of how strong the linear association is between the observed and fitted Y values, but they are not tests of goodness of fit per se. Other measures of fit such as the adjusted R-square and Akaike information criterion (AIC) are designed to take into account the number of X variables in the model Because R-square can never decrease as new X variables are added, the adjusted R-square or AIC may give a better idea of how the strength of the association between the observed and fitted Y values has changed as X variables are added to or deleted from the model. The adjusted R-square may in fact decrease if a new X variable does not substantially increase the amount of variation in Y explained by the X variables.
• Variance inflation factors:
• Variance inflation factors (VIF) measure how much the variance of the estimated coefficients are increased over the case of no correlation among the X variables. If no two X variables are correlated, then all the VIFs will be 1. If the average of the VIFs is much greater than 1, or the maximum VIF is greater than 10, then multicollinearity may be influencing the fitted coefficients. Although large VIFs can indicate the presence of multicollinearity, they can not distinguish between more than one simultaneous case of multicollinearity. Other informal signs of multicollinearity are
• Regression coefficients change drastically when adding or deleting an X variable.
• A regression coefficient is negative when theoretically Y should increase with increasing values of that X variable, or the regression coefficient is positive when theoretically Y should decrease with increasing values of that X variable.
• None of the individual coefficients has a significant t statistic, but the overall F test for fit is significant.
• A regression coefficient has a nonsignificant t statistic, even though on theoretical grounds that X variable should provide substantial information about Y.
• High pairwise correlations between the X variables. (But three or more X variables can be multicollinear together without having high pairwise correlations.)
• Normality test for residuals:
• If all the assumptions for the multiple linear regression hold, all the residuals should come from the same normal distribution with mean 0. Departures from normality can suggest the presence of outliers in the data, or of a nonnormal distribution of the population from which the Y values were drawn. The normality test will give an indication of whether the population from which the Y values were drawn appears to be normally distributed, but will not indicate the cause(s) of the nonnormality. The smaller the sample size, the less likely the normality test will be able to detect nonnormality. If the residuals do not appear to be close to following a normal distribution, then transforming the Y variable may be a reasonable alternative.
• Histogram for residuals:
• The histogram for residuals has a reference normal distribution curve for a normal distribution with the same mean and variance as the residuals. This provides a reference for detecting gross nonnormality when there are many data points.
• Boxplot for residuals:
• Suspected outliers appear in a boxplot as individual points o or x outside the box. If these appear on both sides of the box, they also suggest the possibility of a heavy-tailed distribution. If they appear on only one side, they also suggest the possibility of a skewed distribution. Skewness is also suggested if the mean (+) does not lie on or near the central line of the boxplot, or if the central line of the boxplot does not evenly divide the box. Examples of these plots will help illustrate the various situations.
• Normal probability plot for residuals:
• For data sampled from a normal distribution, the normal probability plot, (normal Q-Q plot) has the points all lying on or near the straight line drawn through the middle half of the points. Scattered points lying away from the line are suspected outliers. Examples of these plots will help illustrate the various situations.
• Residuals plotted against fitted values:
• If the fitted model under the assumption of equality of variance (homoscedasticity) is correct, the plot of residuals against fitted values should suggest a horizontal band across the graph.

A wedge-shaped fan pattern like the profile of a megaphone, with a noticeable flare either to the right or to the left as shown in the picture suggests that the variance in the values increases in the direction the fan pattern widens (usually as the fitted value increases), and this in turn suggests that a transformation of the Y values or a weighted least squares linear regression, may be appropriate. Outliers may appear as anomalous points in the graph (although an outlier may not be apparent in the residuals plot if it also has high leverage, drawing the fitted functions toward it). Other systematic pattern in the residuals (like a linear trend) suggest either that there is another X variable that should be considered in analyzing the data, or that a transformation of X or Y is needed.

• Residuals plotted against X:
• If the assumption of equal variances for the Y is correct, the plot of residuals against each X should suggest a band across the graph with roughly equal vertical width for all values of X. (That is, the shape of the graph should suggest a tilted cigar and not a wedge or a megaphone.)

A wedge-shaped fan pattern like the profile of a megaphone, with a noticeable flare either to the right or to the left as shown in the picture suggests that the variance in the values increases in the direction the fan pattern widens (usually as the sample mean increases), and this in turn suggests that a transformation of the X or Y values or a weighted least squares linear regression, may be appropriate.

Points that are far from the others may be outliers in the data, or may suggest a nonnormal population distribution for Y. If an outlier is a high-leverage point, it may pull the fitted function toward it and perhaps away from the main body of the data, and may not appear as an outlier in the plot of residuals against X. Alternatively, a high-leverage point may make other points appear to be outliers by drawing the fitted function toward itself. Systematic departures from the fitted function (e.g., all the points that are high or low in X have positive residuals while the points with middling values of X have negative residuals) may indicate that a transformation of X, a different linear model, or a nonlinear model may result in a better fit.