If the assumption of equal variances for the Y is correct, the plot of
fitted Y against each X should suggest a band across the graph with roughly
equal vertical width for all values of X. (That is, the shape of the graph
should suggest a tilted cigar and not a wedge or a megaphone.)
A wedge-shaped fan pattern like the profile of a megaphone, with a
noticeable flare either to the right or to the left as shown in the picture
suggests that the variance in the values increases in the direction the fan
pattern widens (usually as X increases), and this in turn suggests that a transformation
of the X or Y values or a weighted
least squares linear regression, may be appropriate.
Points that are far from the others may be outliers
in the data, or may suggest a nonnormal population distribution
for Y. If an outlier is a high-leverage
point, it may pull the fitted function toward it and perhaps away from the
main body of the data, and may not appear as an outlier in the plot of fitted
Y against X. Alternatively, a high-leverage point may make other points
appear to be outliers by drawing the fitted function toward itself.
You may be able to gain additional insight from examining plots of the
observed Y values against individual X variables before you perform the
regression. The plots below illustrate four different scenarios for plots of
the observed Y against individual X:
1. A linear relationship between X and Y seems reasonable.
2. The points seem to follow a curve, not a straight line; a linear
relationship between X and Y does not appear to be appropriate for these data,
A transformation
may create a data set for which a linear fit is appropriate, or a nonlinear
model may provide a better fit.
3. The majority of the points seem to follow a linear trend, but there is
an outlier
which may cause the fitted equation to lie such that it does not provide a
good fit to the majority of the data points. An alternative
regression method may provide a better fit. The outlying data point should
also have its X and Y values doublechecked, in case a recording error has been
made.
4. The majority of the points lie on a vertical straight line, and
only the presence of an outlier has created any variation in X. This situation
may cause the fitted equation to go through the one outlier, so that it will
not turn up as a large residual.
These examples demonstrate the importance of examining plots of the data
whenever a regression is to be done.
A high-leverage
point is one that exerts a great deal of influence on the path of the fitted
equation. For the fitted equation, the centroid
of the data (the point at means of the Xs and mean of Y) acts as a fulcrum,
and the fitted function pivots toward high leverage points, perhaps fitting
the main body of the data poorly. A data point that is extreme in Y, but lies
near the center of the range of one of the X variables, may not have much
effect on that X variable's fitted coefficient, but by changing the estimate
of the mean of Y, it may affect the fitted estimate of the coefficients for
other X variables. If a point has high leverage, then removing it can have a
substantial effect on the estimates of the coefficients, and on the fitted
values of Y, especially if the point also has a relatively large residual. A
alternative
fitting method other than least squares may be a better method in such a
situation.
An observation with leverage greater than 2p/n, where p is the number of
coefficients (including the intercept), and n the number of observations, is a
high-leverage point, and is likely to be an outlier. (The average value of
leverage is p/n.) Other potential signs of high leverage for a observation are
if one observation has a much greater leverage value than all the others, or
if its leverage is greater than 0.5.
Because points with high leverage pull the fitted equation toward them,
they may have small residuals, and thus not stand out in a plot of residuals
against fitted values. A raw residual can be adjusted for the leverage for
the corresponding observation in various ways, producing internally
studentized residuals, deleted residuals, and externally
studentized residuals, also known as studentized deleted
residuals. In each case, points with high leverage will tend to have
larger adjusted residuals than raw residuals. An observation with a
studentized deleted residual greater than 2 in absolute value is likely to be
an outlier
in Y.
In cases of severe multicollinearity, it may not be possible to calculate
some of the diagnostic measures of leverage or influence.
These diagnostics also are not calculated if the fit is exact.
Besides leverage,
there are other measures of whether a single observation has a great deal of
influence on the path of the fitted equation. In general, the idea is to
compare the fit with and without that observation, and produce a measure of
how much the fit is affected by removing that point.
DFFITS measures how much the value fitted Y changes when the
ith point is removed from the data set. Large absolute values of DFFITS
(greater than 1 for smaller data sets or greater than twice the square root of
p/n, where p is the number of coefficients including the intercept, and n the
number of data points) suggest that the corresponding data point is
influential.
Cook's distance measures the combined influence of the ith
point on all the regression coefficients. It takes on greater values for data
points with large residuals, large leverage values, or both.
COVRATIO measures the change in the variance-covariance matrix with
and without the ith point. It takes on greater values for data points
with large leverage values, and tends to be small when when the studentized
deleted residual is large.
DFBETAS measure the influence of a data point on a particular
coefficient. A Large absolute value of DFBETAS(i,j) (greater than 1 for
smaller data sets or greater than twice the square root of 1/n, where n is the
number of data points) suggests that the ith point influences the
jth coefficient.
If two or more influential points are near each other, then each may mask
the effect of deleting the other(s), and then none of them may have a large
value for these influence measures. You may be able to spot such clumps of
points in graphs of Y, fitted Y, or residuals vs individual X.
In cases of severe multicollinearity, it may not be possible to calculate
some of the diagnostic measures of leverage or influence. These diagnostics
also are not calculated if the fit is exact.
If the linear model is in fact the correct one, then the overall F test
for fit tests the null
hypothesis that the all the coefficients for the Xs are 0 (i.e., that
knowledge of the X variables does not allow for better prediction of Y than
knowledge of Y alone, since the fit with all their coefficients equal to 0 is
fitted Y = mean of Y). However, if the number of data points is small, or the
variation in observed X is small (perhaps because the range of the observed X
is restricted), or the residual variance is large, the test may not have
enough power to detect a non-zero coefficient, leading to a nonsignificant
test result. (If the variance of Y is large enough, as in the graph below,
then determining any useful model may be impossible.)
linreg_obscure.gif - image not available
A failure of the test for fit to reject the null hypothesis of zero
coefficients may also happen when the linear model is not appropriate.
Conversely, a significant test result does not necessarily mean that the
linear model is the correct one, only that fitting a multiple linear function
provides a better estimate of Y than simply using the mean of Y.
The R-square statistic and the multiple correlation coefficient are
descriptive measures of how strong the linear association is between the
observed and fitted Y values, but they are not tests of goodness of fit per
se. Other measures of fit such as the adjusted R-square and Akaike
information criterion (AIC) are designed to take into account the number of X
variables in the model Because R-square can never decrease as new X variables
are added, the adjusted R-square or AIC may give a better idea of how the
strength of the association between the observed and fitted Y values has
changed as X variables are added to or deleted from the model. The adjusted
R-square may in fact decrease if a new X variable does not substantially
increase the amount of variation in Y explained by the X variables.
Variance inflation factors (VIF) measure how much the variance of
the estimated coefficients are increased over the case of no correlation among
the X variables. If no two X variables are correlated, then all the VIFs will
be 1. If the average of the VIFs is much greater than 1, or the maximum VIF is
greater than 10, then multicollinearity may be influencing the fitted
coefficients. Although large VIFs can indicate the presence of
multicollinearity, they can not distinguish between more than one simultaneous
case of multicollinearity.
Other informal signs of multicollinearity are
Regression coefficients change drastically when adding or deleting an X
variable.
A regression coefficient is negative when theoretically Y should
increase with increasing values of that X variable, or the regression
coefficient is positive when theoretically Y should decrease with increasing
values of that X variable.
None of the individual coefficients has a significant t statistic, but
the overall F test for fit is significant.
A regression coefficient has a nonsignificant t statistic, even though
on theoretical grounds that X variable should provide substantial
information about Y.
High pairwise correlations between the X variables. (But three or more X
variables can be multicollinear together without having high pairwise
correlations.)
If all the assumptions for the multiple linear regression hold, all the residuals
should come from the same normal
distribution with mean 0. Departures from normality can suggest the
presence of outliers
in the data, or of a nonnormal distribution of the population
from which the Y values were drawn.
The normality test will give an indication of whether the population from
which the Y values were drawn appears to be normally distributed, but will not
indicate the cause(s) of the nonnormality. The smaller the sample size, the
less likely the normality test will be able to detect nonnormality.
If the residuals do not appear to be close to following a normal
distribution, then transforming
the Y variable may be a reasonable alternative.
The histogram
for residuals
has a reference normal
distribution curve for a normal distribution with the same mean and
variance as the residuals. This provides a reference for detecting gross
nonnormality when there are many data points.
Suspected outliers
appear in a boxplot
as individual points o or x outside the box. If these appear on
both sides of the box, they also suggest the possibility of a heavy-tailed
distribution. If they appear on only one side, they also suggest the
possibility of a skewed
distribution. Skewness is also suggested if the mean (+) does not lie
on or near the central line of the boxplot, or if the central line of the
boxplot does not evenly divide the box. Examples
of these plots will help illustrate the various situations.
For data sampled from a normal
distribution, the normal
probability plot, (normal Q-Q plot) has the points all lying on or near
the straight line drawn through the middle half of the points. Scattered
points lying away from the line are suspected outliers.
Examples
of these plots will help illustrate the various situations.
A wedge-shaped fan pattern like the profile of a megaphone, with a
noticeable flare either to the right or to the left as shown in the picture
suggests that the variance in the values increases in the direction the fan
pattern widens (usually as the fitted value increases), and this in turn
suggests that a transformation
of the Y values or a weighted
least squares linear regression, may be appropriate.
Outliers
may appear as anomalous points in the graph (although an outlier may not be
apparent in the residuals plot if it also has high leverage,
drawing the fitted functions toward it).
Other systematic pattern in the residuals (like a linear trend) suggest
either that there is another X variable that should be considered in analyzing
the data, or that a transformation of X or Y is needed.
If the assumption of equal variances for the Y is correct, the plot of
residuals against each X should suggest a band across the graph with roughly
equal vertical width for all values of X. (That is, the shape of the graph
should suggest a tilted cigar and not a wedge or a megaphone.)
A wedge-shaped fan pattern like the profile of a megaphone, with a
noticeable flare either to the right or to the left as shown in the picture
suggests that the variance in the values increases in the direction the fan
pattern widens (usually as the sample mean increases), and this in turn
suggests that a transformation
of the X or Y values or a weighted
least squares linear regression, may be appropriate.
Points that are far from the others may be outliers
in the data, or may suggest a nonnormal population distribution
for Y. If an outlier is a high-leverage
point, it may pull the fitted function toward it and perhaps away from the
main body of the data, and may not appear as an outlier in the plot of
residuals against X. Alternatively, a high-leverage point may make
other points appear to be outliers by drawing the fitted function
toward itself.
Systematic departures from the fitted function (e.g., all the points that
are high or low in X have positive residuals while the points with middling
values of X have negative residuals) may indicate that a transformation
of X, a different
linear model, or a nonlinear
model may result in a better fit.
Satisfaction Guaranteed
If you cannot edit At-PQC™ document(s) with your MS Office, OpenOffice or compatible cloud software program, we will fix it or refund your purchase.
Can't find what you're
looking for...?
Please call, Fax or Email Us at:
Click here to bookmark At-PQC™ then visit our
Toolbox to find a quality control plan that will
help you achieve an effective and efficient business
infrastructure that focuses on customer satisfaction,
continuous improvement and desirable cost savings. Visit
with us today for comprehensive assistance in developing
or choosing the right quality control plan for your
business.
Click here to visit our extensive selection of
quality control plans, policies, procedures and forms or
click here
for help with where-to-start.
We can interact with you anywhere in the USA from 6:00am to 6:00pm Monday through Friday except holidays.
JnF Specialties, LLC
664 Greenscape Lane
Colorado Springs, Colorado 80916
Cellphone Support 6:00am to 6:00pm.
Email Us at:
Send an email to request support or call our helpline during your office hours.