Does your data violate linear regression assumptions?
If the X or Y populations
from which data to be analyzed by linear regression were sampled
violate one or more of the linear regression assumptions, the results of the
analysis may be incorrect or misleading. For example, if the assumption of independence
is violated, then linear regression is not appropriate. If the assumption of normality
is violated, or outliers
are present, then the linear regression goodness of fit test may not be the most
powerful
or informative test available, and this could mean the difference between
detecting a linear fit or not. A nonparametric,
robust,
or resistant
regression method, a transformation,
a weighted
least squares linear regression, or a nonlinear
model may result in a better fit. If the population
variance for Y is not constant, a weighted
least squares linear regression or a transformation
of Y may provide a means of fitting a regression adjusted for the inequality of
the variances. Often, the impact of an assumption violation on the linear
regression result depends on the extent of the violation (such as the how
inconstant the variance of Y is, or how skewed
the Y population distribution
is). Some small violations may have little practical effect on the analysis,
while other violations may render the linear regression result uselessly
incorrect or uninterpretable.
Apparent lack of independence
in the fitted Y values may be caused by the existence of an implicit X
variable in the data, an X variable that was not explicitly used in the linear
model. In this case, the best model may still be linear, but may not include
the original X variable. If there is a linear trend in the plot of the
regression residuals against the fitted values, then an implicit X variable
may be the cause. A plot of the residuals against the prospective new X
variable should reveal whether there is a systematic variation; if there is,
you may consider adding the new X variable to the linear model.
If an implicit X variable is not included in the fitted model, the fitted
estimates for the slope and intercept may be biased,
and not very meaningful, and the fitted Y values may not be accurate.
Another possible cause of apparent dependence between the Y observations is
the presence of an implicit block
effect. (The block effect can be considered another type of implicit X
variable, albeit a discrete one.) If a blocking variable is suspected, an analysis
of covariance can be performed, essentially dividing the data into
different regression lines based on the value of the blocking variable. If the
analysis of covariance shows a significant difference between the slopes in
the regression lines, there is evidence that the linear relationship between X
and Y varies with the value of the blocking factor.
If multiple values of Y are collected at the same values of X, this can act
as another type of blocking, with the unique values of X acting as blocks.
These multiple Y measurements may be less variable than the overall variation
in Y, and, given their common value of X, they are not truly independent
of each other. If there are many replicated X values, and if the variation
between Y at replicated values is much smaller than the overall residual
variance, then the variance of the estimate of the slope may be too small,
making the test of whether the slope is 0 (and, equivalently, the test of the
goodness of linear fit) anticonservative (more likely than the stated significance
level to reject the null
hypothesis, even when it is true). In this case, an alternative method is
to replace each replicated X value by a single data point with the average Y
value, and then perform the regression analysis with the new data set. A
possible drawback to this method is that by reducing the number of data
points, the degrees of freedom associated with the residual error is reduced,
thus potentially reducing the power
of the test.
Whether the Y values are independent
of each other is generally determined by the structure of the experiment from
which they arise. Y values collected over time may be serially correlated
(here time is the implicit factor). If the data are in a particular order,
consider the possibility of dependence. (If the row order of the data reflect
the order in which the data were collected, an index
plot of the data [data value plotted against row number] can reveal
patterns in the plot that could suggest possible time effects.) For serially
correlated Y values, the estimates of the slope and intercept will be unbiased,
but the estimates of their variances will not be reliable.
If you are unsure whether your Y values are independent, you may wish to
consult a statistician or someone who is knowledgeable about the data
collection scheme you are using.
Values may not be identically distributed because of the presence of outliers.
Outliers are anomalous values in the data. Outliers may have a strong
influence over the fitted slope and intercept, giving a poor fit to the bulk
of the data points. Outliers tend to increase the estimate of residual
variance, lowering the chance of rejecting the null
hypothesis. They may be due to recording errors, which may be correctable,
or they may be due to the Y values not all being sampled from the same
population. Apparent outliers may also be due to the Y values being from the
same, but nonnormal,
population. Outliers may show up clearly in a X-Y scatterplot of the data, as
points that do not lie near the general linear trend of the data. A point may
be an unusual value in either X or Y without necessarily being an outlier in
the scatterplot.
Once the regression line has been fitted, the boxplot
and normal
probability plot (normal Q-Q plot) for residuals may suggest the presence
of outliers in the data. After the fit, outliers are usually detected by
examining the residuals
or the high-leverage
points.
The method of least squares involves minimizing the sum of the squared
vertical distances between each data point and the fitted line. Because of
this, the fitted line can be highly sensitive to outliers.
(In other words, least squares regression is not resistant
to outliers, and thus, neither is the fitted slope estimate.) A point
vertically removed from the other points can cause the fitted line to pass
close to it, instead of following the general linear trend of the rest of the
data, especially if the point is relatively far horizontally from the centroid
of the data (the point represented by the mean of X and the mean of Y). Such
points are said to have high leverage:
the centroid acts as a fulcrum, and the fitted line pivots toward
high-leverage points, perhaps fitting the main body of the data poorly. A data
point that is extreme in Y but lies near the center of the data horizontally
will not have much effect on the fitted slope, but by changing the estimate of
the mean of Y, it may affect the fitted estimate of the intercept. A nonparametric
or other alternative regression method may be a better method in such a
situation. If you find outliers in your data that are not due to correctable
errors, you may wish to consult a statistician as to how to proceed.
The values in a sample may indeed be from the same population, but not
from a normal one. Signs of nonnormality
are skewness
(lack of symmetry) or light-tailedness
or heavy-tailedness.
The boxplot,
histogram,
and normal
probability plot (normal Q-Q plot), along with the normality test, can
provide information on the normality of the population distribution. However,
if there are only a small number of data points, nonnormality can be hard to
detect. If there are a great many data points, the normality test may detect
statistically significant but trivial departures from normality that will have
no real effect on the linear regression's tests (since, for example, the t
statistic for the test of the slope will converge in probability to the
standard normal distribution by the law of large numbers).
For data from a normal distribution, normal probability plots should
approximate straight lines, and boxplots should be symmetric (median and mean
together, in the middle of the box) with no outliers.
Except for substantial nonnormality that leads to outliers
in the X-Y data, if the number of data points is not too small, then the
linear regression statistic will not be much affected even if the population
distributions are skewed.
Unless the sample sizes are small (less than 10), light-tailedness
or heavy-tailedness
will have little effect on the linear regression.
Robust
statistical tests operate well across a wide variety of distributions. A test
can be robust for validity, meaning that it provides P values close to the
true ones in the presence of (slight) departures from its assumptions. It may
also be robust for efficiency, meaning that it maintains its statistical power
(the probability that a true violation of the null
hypothesis will be detected by the test) in the presence of those
departures. Linear regression is fairly robust for validity against
nonnormality, but it may not be the most powerful test available for a given
nonnormal
distribution, although it is the most powerful
test available when its test assumptions are met. In the case of nonnormality,
a nonparametric
regression method, or employing a transformation
of X may result in a more powerful test.
If the variance of the Y is not constant, then the the error variance will
not be constant. The most common form of such heteroscedasticity
in Y is that the variance of Y may increase as the mean of Y increases, for
data with positive X and Y.
Unless the heteroscedasticity of the Y is pronounced, its effect will not
be severe: the least squares estimates will still be unbiased,
and the estimates of the slope and intercept will either be normally
distributed if the errors are normally
distributed, or at least normally distributed asymptotically (as the
number of data points becomes large) if the errors are not normally
distributed. The estimate for the variance of the slope and variance will be
inaccurate, but the inaccuracy is not likely to be substantial if the X values
are symmetric about their mean.
Heteroscedasticity of Y is usually detected informally by examining
the X-Y scatterplot of the data before performing the regression. If both
nonlinearity and unequal variances are present, employing a transformation
of Y may have the effect of simultaneously improving the linearity and
promoting equality of the variances. Otherwise, a weighted
least squares linear regression may be the preferred method of dealing
with nonconstant variance of Y.
If the linear model is not the correct one for the data, then the slope
and intercept estimates and the fitted values from the linear regression will
be biased,
and the fitted slope and intercept estimates will not be meaningful. Over a
restricted range of X or Y, nonlinear models may be well approximated by
linear models (this is in fact the basis of linear interpolation), but for
accurate prediction a model appropriate to the data should be selected. An examination
of the X-Y scatterplot may reveal whether the linear model is appropriate.
If there is a great deal of variation in Y, it may be difficult to decide what
the appropriate model is; in this case, the linear model may do as well as any
other, and has the virtue of simplicity.
The usual linear regression model assumes that the observed X variables
are fixed, not random. If the X values are are not under the control of the
experimenter (i.e., are observed but not set), and if there is in fact
underlying variance in the X variable, but they have the same variance, the
linear model is called the errors-in-variables model or the
structural model. The least squares fit will still give the best linear
predictor of Y, but the estimates of the slope and intercept will be biased
(will not have expected values equal to the true slope and variance),
decreased by a factor of (residual variance)/(residual variance +
variance of X) from the true values.
If the assumption of the linear model is correct, the plot of the observed
Y values against X should suggest a linear band across the graph with no
obvious departures from linearity. Outliers
may appear as anomalous points in the graph, often in the upper righthand or
lower lefthand corner of the graph. (A point may be an outlier in either X or
Y without necessarily being far from the general trend of the data.)
If the linear model is not correct, the shape of the general trend of the
X-Y plot may suggest the appropriate function to fit (e.g., a polynomial,
exponential, or logistic function). Alternatively, the plot may suggest a
reasonable transformation
to apply. For example, if the X-Y plot arcs from lower left to upper right so
that data points either very low or very high in X lie below the straight line
suggested by the data, while the data points with middling X values lie on or
above that straight line, taking square roots or logarithms of the X values
may promote linearity.
If the assumption of equal variances for the Y is correct, the plot of the
observed Y values against X should suggest a band across the graph with
roughly equal vertical width for all values of X. (That is, the shape of the
graph should suggest a tilted cigar and not a wedge or a megaphone.)
A fan pattern like the profile of a megaphone, with a noticeable flare
either to the right or to the left as shown in the picture suggests that the
variance in the values increases in the direction the fan pattern widens
(usually as the sample mean increases), and this in turn suggests that a transformation
of the Y values may be needed.
If the number of data points is small, it may be difficult to detect
assumption violations. With small samples, violation assumptions such as nonnormality
or heteroscedasticity
of variances are difficult to detect even when they are present. With a
small number of data points linear regression offers less protection against
violation of assumptions. With few data points, it may be hard to determine
how well the fitted line matches the data, or whether a nonlinear function
would be more appropriate.
Even if none of the test assumptions are violated, a linear regression on a
small number of data points may not have sufficient power
to detect a significant difference between the slope and 0, even if the slope
is non-zero. The power depends on the residual error, the observed variation
in X, the selected significance (alpha-) level of the test, and the number of
data points. Power decreases as the residual variance increases, decreases as
the significance level is decreased (i.e., as the test is made more
stringent), increases as the variation in observed X increases, and increases
as the number of data points increases. If a statistical significance test
with a small number of data values produces a surprisingly non-significant P
value, then lack of power may be the reason. The best time to avoid such
problems is in the design stage of an experiment, when appropriate minimum
sample sizes can be determined, perhaps in consultation with a statistician,
before data collection begins.
The effects of nonconstant
variance of Y can be particularly severe for a linear regression when the
line is forced through the origin: the estimate of variance for the fitted
slope may be much smaller than the actual variance, making the test for the
slope anticonservative (more likely to reject the null
hypothesis that the slope is 0 than the stated significance
level significance level indicates).
In general, unless there is a structural or theoretical reason to assume
that the intercept is 0, it's preferable to fit both the slope and intercept.
If you are unsatisfied with your purchase, you may return it within 30
days for an
exchange, credit or refund.
This guarantee does not cover electronic download products, special requests requiring photocopying
or
engineering aids; however, if you cannot
edit our document(s) in your MS Word, Excel or Visio program we will fix
it or give you a refund.
Can't find what you're
looking for...?
Please call, Fax or Email Us at:
Office: (719) 649-4242
Fax: (719) 573-4205 Home Page
Click here to bookmark At-PQC™ then visit our
Toolbox to find a quality control plan that will
help you achieve an effective and efficient business
infrastructure that focuses on customer satisfaction,
continuous improvement and desirable cost savings. Visit
with us today for comprehensive assistance in developing
or choosing the right quality control plan for your
business.
Click here to visit our extensive selection of
quality control plans, policies, procedures and forms or
click here
for help with where-to-start.
We can interact with you anywhere in the USA from
8:00am to 5:00pm Monday through Friday except holidays.
At-PQC™
JnF Specialties, LLC
664 Greenscape Lane
Colorado Springs, Colorado 80916-5534
Office:
(719) 649-4242
Fax: (719) 573-4205
Email Us at:
Send an email to request next-day support or call our helpline at 719-649-4242
during your office hours
Mon - Fri except holidays.