Date: Wed, 28 Sep 2005 22:08:00 -0300
Reply-To: Hector Maletta <firstname.lastname@example.org>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Hector Maletta <email@example.com>
Subject: Re: data screening help
Content-Type: text/plain; charset="US-ASCII"
Regarding one point made by Richard Ristow I have included a brief comment
> >You said that "the assumption is normality of the residuals,
> not of the
> >body of the data." I'm not too familar with residuals (not
> yet anyway),
> >but aren't residuals usually inspected post-analysis via multiple
> >regression? If so, this suggests that I can start my main
> analyses now
> >and then screen for normality later. But if I did this, what
> would the
> >implications be of finding residuals that suggested non-normality?
To this Richard responded:
> I'm getting out of my depth here. Try, say, Hector Maletta directly.
[Thanks, Richard, for your high expectations about my depth].
> Briefly, as I wrote, the methods are mostly pretty robust
> against modest deviations from normality. I certainly
> wouldn't worry simply because the skewness or kurtosis
> statistics can be shown to be non-zero. Do worry about long
> 'tails' away from the center of the distribution - "outliers."
That's it, mostly, and well within your depth apparently. Given the
procedures to estimate regression equations, it is highly likely that for
all levels of Y the observed values of the DV are distributed above and
below the regression line, i.e. above and below the expected value of Y.
That distribution, ideally, should be normal at all values of Y. This may be
violated in many ways. As Richard said, small deviations from normality are
not very important. More severe deviations may appear, for instance, as
A) For certain values of Y, typically some range of high values, observed
values lie mostly above, or mostly below, the regression line.
B) For certain values of Y, there is a large number of observed values one
side of the regression line, and some or very few outliers far away on the
If this happens only at the extremes of the Y distribution, i.e. for very
high or very low values, where few cases appear anyway, do not worry much.
If that happens in the middle part of the Y range, start worrying.
This has nothing to do with the distribution of Y or the distribution of the
IV by themselves. They may or may not be normally distributed.
Besides the requirement that residuals are [approximately] normally
distributed about the regression line, regression requires also that the
variance of the residuals distribution is the same for all values of Y.
Differences in this variance for different values of Y is called
heterokedasticity (equal variances is homokedasticity). The average residual
(in absolute value) should be approximately the same for all values of Y to
avoit heterokedasticity. Suppose the DV is income; this means that your
average error in predicting income should be more or less the same for small
and for large incomes. In the case of incomes this is not likely when you
deal in dollars (you would err by more when the income is larger), and that
is one reason to transform income into the logarithm of income.
The other reason for using log income is not statistical but economic: in
economic terms the effect of income on the behavior of people is concerned
with proportional increases in income rather than with absolute increases
(10% more income would have similar effects no matter if you earn $10,000 or
$50,000, whereas an increase of $1,000 would have different effects
depending on your income level). In other cases this is not necessarily
true. For instance, the extra energy you need in your heater or air
conditioner in order to increase room temperature by 10 degrees is
approximately the same no matter if initial temperature is 30 or 70 degrees
F. The effect on your budget of increasing temperature by 10 degrees is
(largely) independent of initial temperature, and so you should construct
your model with the absolute amount of energy, not the logarithm.