Y may actually be best modelled by a linear
function that includes other variables in addition to the current set of X
variables, or a subset of the current set of X variables, or a subset of the
current set of X variables plus one or more new X variables. If a graph of the
residuals against the prospective X variable suggests a linear trend, then
adding the new X variable to the model may provide a better model.
A "new" X variable might be derived from one or more X variables already in
the equation, such as using the square of X1 along with X1 to handle curvature
in X1, or adding X1*X2 as a new variable to handle interaction
between X1 and X2.
In a situation of multicollinearity,
a more useful model may actually involve removing one or more X variables,
perhaps also adding one or more new ones.
If there is a blocking
variable such that there is potentially a different linear regression within
each block, then some form of analysis
of covariance may be a better model. In situations where there are
multiple Y values measured at each combination of X values, this situation of
implicit blocking can be dealt with by using the average of the different Y
responses at each combination of X values and fitting the regression to this
reduced data set. A possible drawback to this method is that by reducing the
number of data points, the degrees of freedom associated with the residual
error is reduced, thus potentially reducing the power
of the test.
If Y is actually best modelled by a nonlinear
function of the X variables, especially if a nonlinear model is suggested
on theoretical grounds, then a nonlinear
regression using one of more X can be used to provide the best fit to the
X-Y data. The shape of the X-Y plot for an individual X variable may suggest
an appropriate function to use, such as a polynomial in X
linreg_poly.gif - image not available
or an exponential model.
Transformations
can also be used to deal with nonlinearity, but involving changing the metric
(and possible normality) for either X and Y. However, a nonlinear model
usually is more complex (more parameters) than a transformed linear model. If
there are many parameters to fit and not very many data points, the precision
of the fitted parameters for a more complex model may not be very good.
Transformations (a single function applied to each X data value for one or
more of the X variables, or to each Y data value) are applied to correct
problems of nonnormality
or unequal
variances, and sometimes multicollinearity.
For example, taking logarithms of sample values can reduce skewness
to the right. Transforming the Y values to remedy nonnormality often results
in correcting heteroscedasticity (unequal variances). Occasionally, both X and
Y variables are transformed.
Unless scientific theory suggests a specific transformation a
priori, transformations are usually chosen from the "power family" of
transformations, where each value is replaced by x**p, where p
is an integer or half-integer, usually one of:
-2 (reciprocal square)
-1 (reciprocal)
-0.5 (reciprocal square root)
0 (log transformation)
0.5 (square root)
1 (leaving the data untransformed)
2 (square)
For p = -0.5 (reciprocal square root), 0, or 0.5 (square root), the data
values must all be positive. To use these transformations when there are
negative and positive values, a constant can be added to all the data values
such that the smallest is greater than 0 (say, such that the smallest value is
1). (If all the data values are negative, the data can instead be multiplied
by -1, but note that in this situation, data suggesting skewness
to the right would now become data suggesting skewness to the left.) To
preserve the order of the original data in the transformed data, if the value
of p is negative, the transformed data are multiplied by -1.0; e.g., for p =
-1, the data are transformed as x --> -1.0/x. Taking logs or square roots
tends to "pull in" values greater than 1 relative to values less than 1, which
is useful in correcting skewness to the right.
Another common transformation is the antilogarithm (exp(x)), which has
effects similar to but more extreme than squaring: "drawing out" values
greater than 1 relative to values less than 1.
Generally speaking, transformations of X are used to correct for
non-linearity, and transformations of Y to correct for nonconstant variance of
Y or nonnormality of the error terms. A transformation of Y to correct
nonconstant variance or nonnormality of the error terms may also increase
linearity. Transforming Y may change the error distribution from normal to
nonnormal if the error distribution was normal to begin with.
A transformation of Y involves changing the metric in which the fitted
values are analyzed, which may make interpretation of the results difficult if
the transformation is complicated. If you are unfamiliar with transformations,
you may wish to consult a statistician before proceeding.
The graph of the X-Y data may suggest an appropriate transformation of an X
variable if the plot shows nonlinearity but constant error variance (that is,
the general shape of the plot is not linear, but the vertical deviation in the
data values appears constant over the range of X values).
If the X-Y plot suggests an arc from lower left to upper right so that data
points either very low or very high in X lie below the trend suggested by the
data, while the data points with middling X values lie on or above that trend,
taking square roots or logarithms of the X values may promote linearity:
linreg_logx.gif - image not available
If the X-Y plot suggests an arc from upper left to lower right so that data
points either very low or very high in X lie above the trend suggested by the
data, while the data points with middling X values lie on or below that trend,
taking reciprocals or reciprocals of the antilogarithms of the X values may
promote linearity:
If the X-Y plot suggests an arc from lower left to upper right so that data
points either very low or very high in X lie above the trend suggested by the
data, while the data points with middling X values lie on or below that trend,
taking squares or antilogarithms of the X values may promote linearity:
If the X-Y plot suggests an arc from upper left to lower right so that data
points either very low or very high in X lie below the trend suggested by the
data, while the data points with middling X values lie on or above that trend,
taking squares or antilogarithms of the X values may promote linearity:
The choice of a transformation of Y may be suggested by examining the plot
of residuals against fitted values. If this appears linear, but the variance
of the residuals increases as fitted Y increases, suggesting a wedge or
megaphone shape, then taking square roots, logarithms, or reciprocals of the Y
values may promote homogeneity of variance:
If the plot of residuals against fitted values is a convex arc from lower
left to upper right, and the variance of the residuals increases as fitted Y
increases, then taking square roots of the Y values may promote homogeneity of
variance:
If the plot of residuals against fitted values is a concave arc from upper
left to lower right, and the variance of the residuals decreases as fitted Y
increases, then taking logarithms of the Y values may promote homogeneity of
variance:
When a transformation of Y is indicated, a simultaneous transformation of X
variable(s) may also improve linearity of the fit with the transformed Y.
If the plot of the residuals against fitted Y suggests heteroscedasticity
(a wedge or megaphone shape instead of a featureless cloud of points), then a
weighted linear regression may provide more precise estimates for the
coefficients and intercept. The weights should be chosen to be proportional to
the reciprocal of the variance. For example, if the variance is approximately
proportional to the fitted Y, then weights inversely proportional to the
fitted Y would be appropriate--these weights could be calculated by fitting an
unweighted least squares linear regression, then using the reciprocals of the
fitted values from the unweighted least squares linear regression as the
weights for a weighted least squares linear regression. Alternatively, the
weights could be chosen empirically as the reciprocals of the original Y
values.
Although weighted least squares linear regression may deal with unconstant
variance in Y, it is sensitive to outliers
just as unweighted least squares linear regression is.
One alternative method is to calculate the fit so as to minimize the sum
of the absolute values of the residuals (instead of minimizing the sum of
their squared values).
Most alternative methods to least squares involve iteration to converge to
the final fit, which can make them computationally intensive. And although
alternative methods may be more robust
or resistant
than the least squares fit to departures from normality or to outliers, they
are not necessarily immune.
Unless it involves some form of weighting or trimming values, an
alternative linear regression method will not address the problem of inequality
of variances. Any alternative method for linear regression will assume
that the Y observations are mutually independent, that the residuals have the
same variance and are centered about 0, and that the linear model is in fact
the correct one.
If the Y values do indeed come from populations with normal distributions,
with the Y variable having constant variance, and the linear model is correct,
then the least squares estimates of the coefficients are unbiased
and have the smallest variance among all unbiased estimates of the
coefficients.
A common method of dealing with apparent outliers or high-leverage or
high-influence data points in a regression situation is to remove those
observations and then refit the regression to the remaining points. If the
regression function is not substantially changed by the removal, then the fit
to the remaining points will be improved without misrepresenting the data.
However, if the outliers are due to a nonnormal distribution for the Y sample
population, or to the underlying model being nonlinear, more can be learned by
fitting a better model to the entire data (as by a nonlinear model, a linear
model with additional X variables, a model with transformed X or Y, or an
alternative method of fitting the multiple linear model) than by ignoring
valid data values. And while removing a point that has a large residual may
lead to a smaller residual variance for the new fitted linear function, it
will not necessarily lead to a greater R-square value for the new fitted
model, or to a smaller P value for the F test of overall fit.
A common method of dealing with a large number of X variables is to use a
stepwise regression routine or other mechanical method to identify the "best"
set of X variables to use. Such methods assume that you have identified all
the reasonable candidate X variables, and only need to choose among them.
One method is simply to perform all possible linear regressions, which may
be feasible if the number of candidate X variables is small. (For k X
variables, there are (2**k)-1 regressions, assuming that at least one X
variable will be used. For 4 X variables, this would be 15 possible
regressions.) The regression equations with the smallest adjusted R-square
values and small PRESS values can then be examined further to see which seem
the most reasonable.
If there are too many X variables to examine each possible regression, then
stepwise regression is often used. This is a mechanical method that adds,
deletes, or both adds and deletes X variables one at a time to arrive at a
"best" regression equation. At each step, the decision to add or drop an X
variable is based on a test of whether that variable will or does make a
statistically significant contribution to the model. Stepwise regression
identifies a single regression instead of several possible candidates.
The particular set of X variables suggested by a mechanical method,
especially a stepwise multiple regression, may be very dependent on the
specific data values observed for X and Y. Such models should always be
validated. First, make sure that the model makes sense theoretically, and is
comparable to any results from fitting to other data sets. Then, try the model
out on new data to see if it still holds. One validation method is to divide
the data set into two parts, using one to fit the equation, and the other to
decide whether it is a reasonable model. Validation is vital when using a
stepwise procedure.
One simple method of dealing with multicollinearity is to add more data
observations, aiming at covering a wider range of values in the X variables.
This may or may not be feasible.
Sometimes it is clear that two or more X variables are measuring quantities
that theoretically should be closely related, (such as HDL and total
cholesterol, or area and volume), or that are each closely related to a
variable that you did not or could not measure directly (many variables may be
closely related to age, for example). In such cases, a more useful model may
use only one of the group of such related X variables, so that the fitted
coefficients will be less variable. In general, the fewest possible X
variables that include the available information about Y should be included in
the model, especially if that helps make the number of data observations at
least 6 to 10 times the number of X variables. If the ratio of the total
number of coefficients (including the intercept) to the total number of data
points is greater than 0.4, it will often be difficult to fit a reliable
model.
More formal methods for dealing with multicollinearity include ridge
regression, Bayesian regression, and regression with principal components. See
Belsley
et al. for more details.
Multicollinearity may not be so serious a problem if the purpose of fitting
the regression equation is predicting Y in the range of the X variables,
rather than truly modeling the linear relationship between X and Y and
estimating the values of the individual coefficients.
Satisfaction Guaranteed
If you cannot edit At-PQC™ document(s) with your MS Office, OpenOffice or compatible cloud software program, we will fix it or refund your purchase.
Can't find what you're
looking for...?
Please call, Fax or Email Us at:
Click here to bookmark At-PQC™ then visit our
Toolbox to find a quality control plan that will
help you achieve an effective and efficient business
infrastructure that focuses on customer satisfaction,
continuous improvement and desirable cost savings. Visit
with us today for comprehensive assistance in developing
or choosing the right quality control plan for your
business.
Click here to visit our extensive selection of
quality control plans, policies, procedures and forms or
click here
for help with where-to-start.
We can interact with you anywhere in the USA from 6:00am to 6:00pm Monday through Friday except holidays.
JnF Specialties, LLC
664 Greenscape Lane
Colorado Springs, Colorado 80916
Cellphone Support 6:00am to 6:00pm.
Email Us at:
Send an email to request support or call our helpline during your office hours.