Date: Thu, 26 Jul 2007 09:47:33 -0700
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: Regression Skewed data!
Content-Type: text/plain; format=flowed
> It is true that normality of residuals is probably not an issue of
>you have a large sample. However, to reduce influence of outliers and
>help satisfy the assumption of homoscedasticity (consistent error
>variance is pretty important for precision estimates) you may consider
>a natural log transformation. Weighted regression is another
>alternative. There is 1 more assumption not mentioned previosuly, and
>that is independent errors or observations - that's what mixed models
>are for. happy trails. SH
There are several *different* components involved here.
We assume things about outliers, leverage points, heteroskedasticity, etc.
in order to get 'good' (or best) estimates of the parameters and their
variances. We don't need the normality yet, except for assessing that
'best linear unbiased estimator' stuff. So we have to get that stuff
[Actually we have to get things like multi-collinearity done first, so that
have stable point estimates.]
When we want to do the hypthesis testing and confidence intervals, *then*
we are down to normality of residuals. And I really do want (approximate)
normality of residuals for the hypothesis tests, since we're making strong
assumptions for some of those tests. If I don't have normality, I may need
to perform the analyses using bootstrapping or randomization tests or
But one of the annoying problems with 'skewed' data is that a skewed Y may
not mean anything about the distribution of the residuals. It may also mean
that we have outlier problems rather than 'non-normal residuals', or the
might be contaminating distributions, or lots of other things.
I find that I *rarely* can find a transform that solves all problems
for me. People say "I learned to take logs in grad school", but if taking a
fixes your behavior for Y, it may also mess up everything else we have to
Now if taking logs gives you a meaningful linear model, then that is
different. Subject-matter issues should take precedence. We can always
handle stat details later. But we have to have meaningful interpretations
we are done, or what's the point of starting with the stats anyway?
David L. Cassell
3115 NW Norwood Pl.
Corvallis OR 97330