```Date: Wed, 28 Sep 2005 22:08:00 -0300 Reply-To: Hector Maletta Sender: "SPSSX(r) Discussion" From: Hector Maletta Subject: Re: data screening help Comments: To: Richard Ristow In-Reply-To: <5.1.0.14.2.20050928184604.060841a0@pop.mindspring.com> Content-Type: text/plain; charset="US-ASCII" Kathryn, Regarding one point made by Richard Ristow I have included a brief comment below. You wrote: > >You said that "the assumption is normality of the residuals, > not of the > >body of the data." I'm not too familar with residuals (not > yet anyway), > >but aren't residuals usually inspected post-analysis via multiple > >regression? If so, this suggests that I can start my main > analyses now > >and then screen for normality later. But if I did this, what > would the > >implications be of finding residuals that suggested non-normality? To this Richard responded: > I'm getting out of my depth here. Try, say, Hector Maletta directly. [Thanks, Richard, for your high expectations about my depth]. > Briefly, as I wrote, the methods are mostly pretty robust > against modest deviations from normality. I certainly > wouldn't worry simply because the skewness or kurtosis > statistics can be shown to be non-zero. Do worry about long > 'tails' away from the center of the distribution - "outliers." That's it, mostly, and well within your depth apparently. Given the procedures to estimate regression equations, it is highly likely that for all levels of Y the observed values of the DV are distributed above and below the regression line, i.e. above and below the expected value of Y. That distribution, ideally, should be normal at all values of Y. This may be violated in many ways. As Richard said, small deviations from normality are not very important. More severe deviations may appear, for instance, as follows: A) For certain values of Y, typically some range of high values, observed values lie mostly above, or mostly below, the regression line. B) For certain values of Y, there is a large number of observed values one side of the regression line, and some or very few outliers far away on the other side. If this happens only at the extremes of the Y distribution, i.e. for very high or very low values, where few cases appear anyway, do not worry much. If that happens in the middle part of the Y range, start worrying. This has nothing to do with the distribution of Y or the distribution of the IV by themselves. They may or may not be normally distributed. Besides the requirement that residuals are [approximately] normally distributed about the regression line, regression requires also that the variance of the residuals distribution is the same for all values of Y. Differences in this variance for different values of Y is called heterokedasticity (equal variances is homokedasticity). The average residual (in absolute value) should be approximately the same for all values of Y to avoit heterokedasticity. Suppose the DV is income; this means that your average error in predicting income should be more or less the same for small and for large incomes. In the case of incomes this is not likely when you deal in dollars (you would err by more when the income is larger), and that is one reason to transform income into the logarithm of income. The other reason for using log income is not statistical but economic: in economic terms the effect of income on the behavior of people is concerned with proportional increases in income rather than with absolute increases (10% more income would have similar effects no matter if you earn \$10,000 or \$50,000, whereas an increase of \$1,000 would have different effects depending on your income level). In other cases this is not necessarily true. For instance, the extra energy you need in your heater or air conditioner in order to increase room temperature by 10 degrees is approximately the same no matter if initial temperature is 30 or 70 degrees F. The effect on your budget of increasing temperature by 10 degrees is (largely) independent of initial temperature, and so you should construct your model with the absolute amount of energy, not the logarithm. Hector Hector ```

Back to: Top of message | Previous page | Main SPSSX-L page