Date: Wed, 28 Sep 2005 15:35:50 -0400
Reply-To: Richard Ristow <wrristow@mindspring.com>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Richard Ristow <wrristow@mindspring.com>
Subject: Re: data screening help
In-Reply-To: <BAY101-F136305890B2F5CB5E91478A68D0@phx.gbl>
Content-Type: text/plain; charset="iso-8859-1"; format=flowed
At 01:25 PM 9/28/2005, Kathryn Gardner wrote:
>Sorry, you're right in that I meant - among continuous variables
>screening for outliers depends on whether data are grouped. So my
>question is "I am using analyses that will involve both the use of
>grouped (ANOVA) and ungrouped (Regression) data, so in light of this,
>how should I screen my data? According to Tabachnick and Fidell,
>grouped data means screening separately within each group, while
>ungrouped means screening among all cases at once.
Certainly, grouped rather than ungrouped. Among other things, if the
grouping variable is an independent variable for your analysis (so, an
ANOVA), if you look at the individual groups you're looking at the
residuals. And, as I (and many others) have stated, the assumption is
normality of the residuals, not of the body of the data.
>I am aware of the debate surrounding transforming data and deleting
>outliers, and do actually agree that variables should not be
>transformed if they are real unusual values…
>
>5)… but I thought that a normal distribution is one requirement for
>using tests such as ANOVA, MR, FA and correlation. I have also been
>following the book “Using Multivariate Statistics” by Tabachnick and
>Fidell, where the advice is to deal with outliers by transforming or
>deleting them, and to transform data to address skew and
>kurtosis. So…in a nutshell you are suggesting that it’s best not to
>transform at all.
I'm afraid I am suggesting that. Another way: as I said in one of the
postings I quoted, if you think transforming is all right, you think
that your scale of measurement doesn't mean anything, as a scale. That
is, you think the order of values is meaningful, but the difference of
values on your numerical scale is not, since you're willing to change
that to make data "work better." I would argue that, if you believe
that is right, you also believe your data is of ordinal level only, and
should do non-parametric analysis. That, incidentally, eliminates
particular sensitivity to outliers.
>If I don’t transform my data or delete outliers etc, this means that I
>have about 8 variables with skewness and kurtosis, univariate and
>multivariate outliers and non-linearity etc.
First, I assume you mean 8 dependent variables. I'm not aware of any
distributional requirements for independent variables.
Excuse me if I'm out of my depth: could you say what a multivariate
outlier is?
And when you say non-linearity, in what sense do you mean it? If you do
think that a quantity affects the outcome, or is affected,
non-linearly, and you have an argument what the shape of the non-linear
effect is, by all means transform accordingly. (See the example of
income.)
>My data (outliers) are actually genuine unusual scores. So in light of
>our discussion so far then, my other questions are: a) if my data is
>skewed with kurtosis and outliers etc, am I best to simply leave this
>as it is? If so… b) …am I OK perform analyses such as ANOVA, MR, FA
>and pearson correlation on this data? c) should I not at least deal
>with “really extreme” outliers?
As a maybe naive way of looking at it, see the discussion of "cloud and
outlier" distributions that was part of the last post.
In essence: what mechanism seems to be underlying a distribution that
generates the majority of values within a limited range, and a small
minority far outside that range?
I can't solve the problem. Given what you're seeing, one might
hypothesize two underlying mechanisms: a 'normal' one that generates
variation within the range where most of your observations lie, and a
'special' one that operates in only a small minority of cases, but
generates very large values. You might, then, trim your outliers, and
say explicitly that you're looking only for the 'normal' mechanism. But
in most real situation, extreme values matter. It's good if you can get
some idea under what circumstance the 'special' mechanism is invoked.
And, I'm afraid, that's as far as I can go with my statistical
knowledge, and without subject-specific information.
Again, good luck,
Richard Ristow
|