If the assumption of equal variances for the Y is correct, the plot of the
observed Y values against X should suggest a band across the graph with
roughly equal vertical width for all values of X. (That is, the shape of the
graph should suggest a tilted cigar and not a wedge or a megaphone.)
A wedge-shaped fan pattern like the profile of a megaphone, with a
noticeable flare either to the right or to the left as shown in the picture
suggests that the variance in the values increases in the direction the fan
pattern widens (usually as the sample mean increases), and this in turn
suggests that a transformation
of the Y values or a weighted
least squares linear regression, may be appropriate.
If the fitted linear model is correct, the fitted line should run along the
general linear trend suggested by the data points. Points that are far from
the fitted line may be outliers
in the data, or may suggest a nonnormal population distribution
for Y. If an outlier is a high-leverage
point, it may pull the fitted line toward it and perhaps away from the
main body of the data.
Systematic departures from the fitted line (e.g., all the points that are
high or low in X lie above the line while the points with middling values of X
lie near or below it) may indicate that a transformation
of X, a different
linear model, or a nonlinear
model may result in a better fit.
The four graphs shown below all have the same
fitted slope (0.5) and intercept (3), the same fitted 95% confidence bounds
for the fitted Y values, the same value of r (0.816) and R-square (0.667), and
the same results for the overall F test for linear fit (F(1,9) = 18; P =
0.0022). However, only the first graph shows a fitted line that provides a
good fit to the data. (The data are artificial, taken from F.J. Anscombe.
1973. Graphs in Statistical Analysis. American Statistician 27: 17-21.)
The plots of the fitted line with the observed values illustrate four
different scenarios:
1. The straight-line fit seems reasonable.
2. The points seem to follow a curve, not a straight line; a straight-line
fit does not appear to be appropriate for these data, A transformation
may create a data set for which a straight-line fit is appropriate, or a nonlinear
model may provide a better fit.
3. The majority of the points seem to follow a straight line, but it's not
the fitted line; an outlier
has caused the fitted line to lie such that it does not provide a good linear
fit to the majority of the data points. A nonparametric
or other alternative regression method may provide a better fit. The outlying
data point should also have its X and Y values doublechecked, in case a
recording error has been made.
4. The majority of the points lie on a vertical straight line, and
only the presence of an outlier has allowed a least-squares linear regression
line, albeit a poor one, to be fitted at all (a vertical line has infinite
slope, and can not be fitted by least squares). Note that the fitted line goes
through the one outlier, so that it will not turn up as a large residual.
These examples demonstrate the importance of examining the plot of the fit
whenever a regression is done.
A high-leverage
point is one that exerts a great deal of influence on the path of the fitted
line. For the fitted line, the centroid
of the data (the point at mean of X and mean of Y) acts as a fulcrum, and the
fitted line pivots toward high leverage points, perhaps fitting the main body
of the data poorly. A data point that is extreme in Y but lies near the center
of the data horizontally will not have much effect on the fitted slope, but by
changing the estimate of the mean of Y, it may affect the fitted estimate of
the intercept. If a point has high leverage, then removing it can have a
substantial effect on the estimates of the slope and intercept, and on the
fitted values of Y, especially if the point also has a relatively large
residual. A nonparametric
or other alternative regression method may be a better method in such a
situation.
If the linear model is in fact the correct one, then the overall F test
for fit tests the null
hypothesis that the slope is 0 (i.e., that knowledge of X does not allow
for better prediction of Y than knowledge of Y alone, since the fit with slope
0 is fitted Y = mean of Y). However, if the number of data points is small, or
the variation in observed X is small (perhaps because the range of the
observed X is restricted), or the residual variance is large, the test may not
have enough power to detect a non-zero slope, leading to a nonsignificant test
result. (If the variance of Y is large enough, as in the graph below, then
determining any useful model may be impossible.)
For a simple linear regression that is not forced through the
origin, the F statistic for overall fit is the square of the t statistic for
the fitted slope, and the overall test for fit is equivalent to the test that
the slope is significantly different from 0.
A failure of the test for fit to reject the null hypothesis of zero slope
may also happen when the linear model is not appropriate. Conversely, a
significant test result does not necessarily mean that the linear model is the
correct one, only that fitting a sloping straight line provides a better
estimate of Y than using the mean of Y (i.e., a straight line with slope 0).
The examples
of graphs of the fitted line show how very different data sets can give
the same result for the F test of overall fit, even if the straight-line model
is not appropriate.
For a simple linear regression that is not forced through the
origin, the R-square statistic is equal to the square of r, the Pearson
estimate of correlation between X and Y. The R-square statistic and the
correlation coefficient are descriptive measures of how strong the linear
association is between X and Y, but they are not tests of goodness of fit
per se. For a simple linear regression that is not forced
through the origin, R-square is equal to the ratio between the variance
estimates for Y and X times the square of estimate of the fitted slope.
If all the assumptions for the linear regression hold, all the residuals
should come from the same normal
distribution with mean 0. Departures from normality can suggest the
presence of outliers
in the data, or of a nonnormal distribution of the population
from which the Y values were drawn.
The normality test will give an indication of whether the population from
which the Y values were drawn appears to be normally distributed, but will not
indicate the cause(s) of the nonnormality. The smaller the sample size, the
less likely the normality test will be able to detect nonnormality.
The histogram
for residuals
has a reference normal
distribution curve for a normal distribution with the same mean and
variance as the residuals. This provides a reference for detecting gross
nonnormality when there are many data points.
Suspected outliers
appear in a boxplot
as individual points o or x outside the box. If these appear on
both sides of the box, they also suggest the possibility of a heavy-tailed
distribution. If they appear on only one side, they also suggest the
possibility of a skewed
distribution. Skewness is also suggested if the mean (+) does not lie
on or near the central line of the boxplot, or if the central line of the
boxplot does not evenly divide the box. Examples
of these plots will help illustrate the various situations.
For data sampled from a normal
distribution, the normal
probability plot, (normal Q-Q plot) has the points all lying on or near
the straight line drawn through the middle half of the points. Scattered
points lying away from the line are suspected outliers.
Examples
of these plots will help illustrate the various situations.
A wedge-shaped fan pattern like the profile of a megaphone, with a
noticeable flare either to the right or to the left as shown in the picture
suggests that the variance in the values increases in the direction the fan
pattern widens (usually as the fitted value increases), and this in turn
suggests that a transformation
of the Y values or a weighted
least squares linear regression, may be appropriate.
Outliers
may appear as anomalous points in the graph (although an outlier may not be
apparent in the residuals plot if it also has high leverage,
drawing the fitted line toward it).
Other systematic pattern in the residuals (like a linear trend) suggest
either that there is another X variable that should be considered in analyzing
the data, or that a transformation of X or Y is needed.
In the case of simple linear regression, the fitted values are a linear
transformation of X, so the plot of the residuals against X and the plot of
residuals against fitted values are identical except for scale and
translation. Thus, the information provided by the plot of residuals against X
is the same as that provided by residuals plotted against fitted
values.
If you are unsatisfied with your purchase, you may return it within 30
days for an
exchange, credit or refund.
This guarantee does not cover electronic download products, special requests requiring photocopying
or
engineering aids; however, if you cannot
edit our document(s) in your MS Word, Excel or Visio program we will fix
it or give you a refund.
Can't find what you're
looking for...?
Please call, Fax or Email Us at:
Office: (719) 649-4242
Fax: (719) 573-4205 Home Page
Click here to bookmark At-PQC™ then visit our
Toolbox to find a quality control plan that will
help you achieve an effective and efficient business
infrastructure that focuses on customer satisfaction,
continuous improvement and desirable cost savings. Visit
with us today for comprehensive assistance in developing
or choosing the right quality control plan for your
business.
Click here to visit our extensive selection of
quality control plans, policies, procedures and forms or
click here
for help with where-to-start.
We can interact with you anywhere in the USA from
8:00am to 5:00pm Monday through Friday except holidays.
At-PQC™
JnF Specialties, LLC
664 Greenscape Lane
Colorado Springs, Colorado 80916-5534
Office:
(719) 649-4242
Fax: (719) 573-4205
Email Us at:
Send an email to request next-day support or call our helpline at 719-649-4242
during your office hours
Mon - Fri except holidays.