LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2005)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 28 Sep 2005 15:35:50 -0400
Reply-To:     Richard Ristow <wrristow@mindspring.com>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         Richard Ristow <wrristow@mindspring.com>
Subject:      Re: data screening help
Comments: To: Kathryn Gardner <kjgardner10@hotmail.com>
In-Reply-To:  <BAY101-F136305890B2F5CB5E91478A68D0@phx.gbl>
Content-Type: text/plain; charset="iso-8859-1"; format=flowed

At 01:25 PM 9/28/2005, Kathryn Gardner wrote:

>Sorry, you're right in that I meant - among continuous variables >screening for outliers depends on whether data are grouped. So my >question is "I am using analyses that will involve both the use of >grouped (ANOVA) and ungrouped (Regression) data, so in light of this, >how should I screen my data? According to Tabachnick and Fidell, >grouped data means screening separately within each group, while >ungrouped means screening among all cases at once.

Certainly, grouped rather than ungrouped. Among other things, if the grouping variable is an independent variable for your analysis (so, an ANOVA), if you look at the individual groups you're looking at the residuals. And, as I (and many others) have stated, the assumption is normality of the residuals, not of the body of the data.

>I am aware of the debate surrounding transforming data and deleting >outliers, and do actually agree that variables should not be >transformed if they are real unusual values… > >5)… but I thought that a normal distribution is one requirement for >using tests such as ANOVA, MR, FA and correlation. I have also been >following the book “Using Multivariate Statistics” by Tabachnick and >Fidell, where the advice is to deal with outliers by transforming or >deleting them, and to transform data to address skew and >kurtosis. So…in a nutshell you are suggesting that it’s best not to >transform at all.

I'm afraid I am suggesting that. Another way: as I said in one of the postings I quoted, if you think transforming is all right, you think that your scale of measurement doesn't mean anything, as a scale. That is, you think the order of values is meaningful, but the difference of values on your numerical scale is not, since you're willing to change that to make data "work better." I would argue that, if you believe that is right, you also believe your data is of ordinal level only, and should do non-parametric analysis. That, incidentally, eliminates particular sensitivity to outliers.

>If I don’t transform my data or delete outliers etc, this means that I >have about 8 variables with skewness and kurtosis, univariate and >multivariate outliers and non-linearity etc.

First, I assume you mean 8 dependent variables. I'm not aware of any distributional requirements for independent variables.

Excuse me if I'm out of my depth: could you say what a multivariate outlier is?

And when you say non-linearity, in what sense do you mean it? If you do think that a quantity affects the outcome, or is affected, non-linearly, and you have an argument what the shape of the non-linear effect is, by all means transform accordingly. (See the example of income.)

>My data (outliers) are actually genuine unusual scores. So in light of >our discussion so far then, my other questions are: a) if my data is >skewed with kurtosis and outliers etc, am I best to simply leave this >as it is? If so… b) …am I OK perform analyses such as ANOVA, MR, FA >and pearson correlation on this data? c) should I not at least deal >with “really extreme” outliers?

As a maybe naive way of looking at it, see the discussion of "cloud and outlier" distributions that was part of the last post.

In essence: what mechanism seems to be underlying a distribution that generates the majority of values within a limited range, and a small minority far outside that range?

I can't solve the problem. Given what you're seeing, one might hypothesize two underlying mechanisms: a 'normal' one that generates variation within the range where most of your observations lie, and a 'special' one that operates in only a small minority of cases, but generates very large values. You might, then, trim your outliers, and say explicitly that you're looking only for the 'normal' mechanism. But in most real situation, extreme values matter. It's good if you can get some idea under what circumstance the 'special' mechanism is invoked.

And, I'm afraid, that's as far as I can go with my statistical knowledge, and without subject-specific information.

Again, good luck, Richard Ristow


Back to: Top of message | Previous page | Main SPSSX-L page