All the following results are provided as part of a linear regression analysis.
Results for fitted line:
Results for residuals:residuals
- Normality test for residuals: detecting violation of normality assumption
- Histogram for residuals: detecting assumption violations graphically
- Boxplot for residuals: detecting assumption violations graphically
- Normal probability plot for residuals: detecting assumption violations graphically
- Residuals plotted against fitted values: detecting incorrectness of the linear regression model
- Residuals plotted against X: detecting incorrectness of the linear regression model
- Graph of the fitted line:
- If the assumption of equal variances for the Y is correct, the plot of the observed Y values against X should suggest a band across the graph with roughly equal vertical width for all values of X. (That is, the shape of the graph should suggest a tilted cigar and not a wedge or a megaphone.) A wedge-shaped fan pattern like the profile of a megaphone, with a noticeable flare either to the right or to the left as shown in the picture suggests that the variance in the values increases in the direction the fan pattern widens (usually as the sample mean increases), and this in turn suggests that a transformation of the Y values or a weighted least squares linear regression, may be appropriate. If the fitted linear model is correct, the fitted line should run along the general linear trend suggested by the data points. Points that are far from the fitted line may be outliers in the data, or may suggest a nonnormal population distribution for Y. If an outlier is a high-leverage point, it may pull the fitted line toward it and perhaps away from the main body of the data. Systematic departures from the fitted line (e.g., all the points that are high or low in X lie above the line while the points with middling values of X lie near or below it) may indicate that a transformation of X, a different linear model, or a nonlinear model may result in a better fit. The four graphs shown below all have the same fitted slope (0.5) and intercept (3), the same fitted 95% confidence bounds for the fitted Y values, the same value of r (0.816) and R-square (0.667), and the same results for the overall F test for linear fit (F(1,9) = 18; P = 0.0022). However, only the first graph shows a fitted line that provides a good fit to the data. (The data are artificial, taken from F.J. Anscombe. 1973. Graphs in Statistical Analysis. American Statistician 27: 17-21.) The plots of the fitted line with the observed values illustrate four different scenarios: 1. The straight-line fit seems reasonable. 2. The points seem to follow a curve, not a straight line; a straight-line fit does not appear to be appropriate for these data, A transformation may create a data set for which a straight-line fit is appropriate, or a nonlinear model may provide a better fit. 3. The majority of the points seem to follow a straight line, but it's not the fitted line; an outlier has caused the fitted line to lie such that it does not provide a good linear fit to the majority of the data points. A nonparametric or other alternative regression method may provide a better fit. The outlying data point should also have its X and Y values doublechecked, in case a recording error has been made. 4. The majority of the points lie on a vertical straight line, and only the presence of an outlier has allowed a least-squares linear regression line, albeit a poor one, to be fitted at all (a vertical line has infinite slope, and can not be fitted by least squares). Note that the fitted line goes through the one outlier, so that it will not turn up as a large residual. These examples demonstrate the importance of examining the plot of the fit whenever a regression is done.
- High-leverage points:
- A high-leverage point is one that exerts a great deal of influence on the path of the fitted line. For the fitted line, the centroid of the data (the point at mean of X and mean of Y) acts as a fulcrum, and the fitted line pivots toward high leverage points, perhaps fitting the main body of the data poorly. A data point that is extreme in Y but lies near the center of the data horizontally will not have much effect on the fitted slope, but by changing the estimate of the mean of Y, it may affect the fitted estimate of the intercept. If a point has high leverage, then removing it can have a substantial effect on the estimates of the slope and intercept, and on the fitted values of Y, especially if the point also has a relatively large residual. A nonparametric or other alternative regression method may be a better method in such a situation.
- Test of goodness of fit:
- If the linear model is in fact the correct one, then the overall F test for fit tests the null hypothesis that the slope is 0 (i.e., that knowledge of X does not allow for better prediction of Y than knowledge of Y alone, since the fit with slope 0 is fitted Y = mean of Y). However, if the number of data points is small, or the variation in observed X is small (perhaps because the range of the observed X is restricted), or the residual variance is large, the test may not have enough power to detect a non-zero slope, leading to a nonsignificant test result. (If the variance of Y is large enough, as in the graph below, then determining any useful model may be impossible.) For a simple linear regression that is not forced through the origin, the F statistic for overall fit is the square of the t statistic for the fitted slope, and the overall test for fit is equivalent to the test that the slope is significantly different from 0. A failure of the test for fit to reject the null hypothesis of zero slope may also happen when the linear model is not appropriate. Conversely, a significant test result does not necessarily mean that the linear model is the correct one, only that fitting a sloping straight line provides a better estimate of Y than using the mean of Y (i.e., a straight line with slope 0). The examples of graphs of the fitted line show how very different data sets can give the same result for the F test of overall fit, even if the straight-line model is not appropriate. For a simple linear regression that is not forced through the origin, the R-square statistic is equal to the square of r, the Pearson estimate of correlation between X and Y. The R-square statistic and the correlation coefficient are descriptive measures of how strong the linear association is between X and Y, but they are not tests of goodness of fit per se. For a simple linear regression that is not forced through the origin, R-square is equal to the ratio between the variance estimates for Y and X times the square of estimate of the fitted slope.
- Normality test for residuals:
- If all the assumptions for the linear regression hold, all the residuals should come from the same normal distribution with mean 0. Departures from normality can suggest the presence of outliers in the data, or of a nonnormal distribution of the population from which the Y values were drawn. The normality test will give an indication of whether the population from which the Y values were drawn appears to be normally distributed, but will not indicate the cause(s) of the nonnormality. The smaller the sample size, the less likely the normality test will be able to detect nonnormality.
- Histogram for residuals:
- The histogram for residuals has a reference normal distribution curve for a normal distribution with the same mean and variance as the residuals. This provides a reference for detecting gross nonnormality when there are many data points.
- Boxplot for residuals:
- Suspected outliers appear in a boxplot as individual points o or x outside the box. If these appear on both sides of the box, they also suggest the possibility of a heavy-tailed distribution. If they appear on only one side, they also suggest the possibility of a skewed distribution. Skewness is also suggested if the mean (+) does not lie on or near the central line of the boxplot, or if the central line of the boxplot does not evenly divide the box. Examples of these plots will help illustrate the various situations.
- Normal probability plot for residuals:
- For data sampled from a normal distribution, the normal probability plot, (normal Q-Q plot) has the points all lying on or near the straight line drawn through the middle half of the points. Scattered points lying away from the line are suspected outliers. Examples of these plots will help illustrate the various situations.
- Residuals plotted against fitted values:
- If the fitted model under the assumption of equality of variance (homoscedasticity) is correct, the plot of residuals against fitted values should suggest a horizontal band across the graph. A wedge-shaped fan pattern like the profile of a megaphone, with a noticeable flare either to the right or to the left as shown in the picture suggests that the variance in the values increases in the direction the fan pattern widens (usually as the fitted value increases), and this in turn suggests that a transformation of the Y values or a weighted least squares linear regression, may be appropriate.
Outliers may appear as anomalous points in the graph (although an outlier may not be apparent in the residuals plot if it also has high leverage, drawing the fitted line toward it). Other systematic pattern in the residuals (like a linear trend) suggest either that there is another X variable that should be considered in analyzing the data, or that a transformation of X or Y is needed.
- Residuals plotted against X:
- In the case of simple linear regression, the fitted values are a linear transformation of X, so the plot of the residuals against X and the plot of residuals against fitted values are identical except for scale and translation. Thus, the information provided by the plot of residuals against X is the same as that provided by residuals plotted against fitted values.