Date: Wed, 28 Sep 2005 18:25:29 +0100
ReplyTo: Kathryn Gardner <kjgardner10@hotmail.com>
Sender: "SPSSX(r) Discussion" <SPSSXL@LISTSERV.UGA.EDU>
From: Kathryn Gardner <kjgardner10@hotmail.com>
Subject: Re: data screening help
InReplyTo: <5.1.0.14.2.20050928114210.05d5e3d0@pop.mindspring.com>
ContentType: text/plain; format=flowed
Hi Richard,
Thank you for your comprehensive reply. A lot of info to digest there. I'll
address your comments in turn:
2) I phrased things badly here and got terminology mixed up. Sorry, you're
right in that I meant  among continuous variables screening for outliers
depends on whether data are grouped. So my question is "I am using analyses
that will involve both the use of grouped (ANOVA) and ungrouped (Regression)
data, so in light of this, how should I screen my data? According to
Tabachnick and Fidell, grouped data means screening separately within each
group, while ungrouped means screening among all cases at once.
I am aware of the debate surrounding transforming data and deleting
outliers, and do actually agree that variables should not be transformed if
they are real unusual values…
5)… but I thought that a normal distribution is one requirement for using
tests such as ANOVA, MR, FA and correlation. I have also been following the
book “Using Multivariate Statistics” by Tabachnick and Fidell, where the
advice is to deal with outliers by transforming or deleting them, and to
transform data to address skew and kurtosis. So…in a nutshell you are
suggesting that it’s best not to transform at all. If I don’t transform my
data or delete outliers etc, this means that I have about 8 variables with
skewness and kurtosis, univariate and multivariate outliers and
nonlinearity etc. As we’ve already discussed/established, my data
(outliers) are actually genuine unusual scores. So in light of our
discussion so far then, my other questions are:
a) if my data is skewed with kurtosis and outliers etc, am I best to simply
leave this as it is? If so…
b) …am I OK perform analyses such as ANOVA, MR, FA and pearson correlation
on this data?
c) should I not at least deal with “really extreme” outliers?
Thanks
Kathryn
>From: Richard Ristow <wrristow@mindspring.com>
>To: Kathryn Gardner <KJGARDNER10@HOTMAIL.COM>,SPSSXL@LISTSERV.UGA.EDU
>Subject: Re: data screening help
>Date: Wed, 28 Sep 2005 12:14:05 0400
>
>At 10:32 AM 9/28/2005, Kathryn Gardner wrote:
>
>>>I have a number of questions relating to data screening (i.e.,
>>>outlier,normality, linearity checks) that I am hoping people can help me
>>>out with
>
>We've discussed his subject before, and there are some pretty strong
>opinions. Let me start. This has come up enough that I find I can quote a
>number of my own old postings. I think you'll hear from Hector Maletta and
>Art Kendall, as well. For clarity, I'm doublequoting what you've written,
>singlequoting any replies that are quoted from previous posts.
>
>>>2) The procedure for detecting outliers depends on whether data is
>>>continuous or categorical. If it is continuous this means data screening
>>>the sample as a whole, if categorical this means screening by group. I am
>>>using analyses that will involve both the use of continuous and
>>>categorical data, so how should I screen my data? I've been screening as
>>>a whole up until now. Besides, if I decided to screen using groups, at
>>>what point do I decide not the split the data into groups i.e., I could
>>>split my data according to gender, age, ethnicity, education, occupation,
>>>country etc etc.
>
>First, when you say, "procedure for detecting outliers depends on whether
>data is continuous or categorical," it sounds like you don't mean quite
>what you say. Screening categorical data for outliers is rarely meaningful.
>It sounds like you mean screening for outliers in continuous variables
>within categorical groups.
>
>Second, "outlier" is now questioned as a notion. In brief, some "outliers"
>can confidently be identified as erroneous data; they should be corrected
>from the source, or treated as missing. But other "outliers" are
>legitimate, though unusual, observed values of the quantity. Removing them
>can distort your analysis altogether. Quoting myself(1),
>
>>"Outliers" get a good deal of discussion, in statistics. There are several
>>cases:
>>
>>. Values outside the range that's *a priori* possible, or far outside
>>reasonable experience. These may be assumed to be data errors, and treated
>>as missing values (correctly, as the true value is not known) if it's not
>>feasible to correct them from the original source.
>>
>>. "Cloud and outlier" distributions: Most of the observations are
>>concentrated in a range of modest size (the 'cloud') There are a few with
>>much larger values (the 'outliers'), and it is reasonable to believe that
>>the outliers are real, unusual values. Here, a regression fit including
>>all data is reasonable, BUT: because the outliers have so much leverage,
>>you're likely to get a model that's predictive or explanatory for what
>>distinguishes 'cloud' points from the outliers, but is much less relevant
>>within the 'cloud'. In this case, you're wise to fit models with and
>>without outliers, and interpret carefully the differences between them.
>>Simply throwing away 'outliers' that can't be dismissed as errors, is
>>currently frowned upon: you may give yourself a badly distorted view of
>>your population. However, eliminating the 'outliers' because you're
>>studying a part of the range (the 'cloud) is defensible. That appears to
>>be what you're doing. It's just fine, subject to the criticism that it
>>doesn't tell you much about the extremes of the income range.
>>
>>. Distributions that cover a wide dynamic range, with cases observed over
>>the whole range. Here, you want to look for a model that covers the whole
>>range. This is where a log transform may be appropriate, if it's
>>defensible on theoretical grounds. A linear fit using the United States
>>income distribution would probably end up explaining the largest incomes,
>>and effectively lumping a lot of the range as "more or less zero, within
>>measurement error."
>
>>>3) Related to the above Q, does the idea that screening for outliers
>>>depends on whether data is continuous or categorical apply to all data
>>>screening procedures i.e., normality analyses?
>
>Again, are you're talking not about screening categorical data for
>outliers, but screening continuous data within categories? If not, I'm
>missing your point badly.
>
>>>4) I have screened my data according to subscales rather than full scale
>>>scores e.g., checked the normality of each individual subscale on each
>>>questionnaire (some questionnaires don't produce full scale scores). I
>>>don't know whether this is standard practice, but to me it makes sense to
>>>screen by subscale. I do however, have a variable that does produce
>>>subscales, but I have had to use the full scale score in my data
>>>screening because I can't split it into subscales until I've factor
>>>analysed it. Is this OK?
>
>So far, so good, but
>>>5) I've been using logarithm & square root transformations etc to reduce
>>>skew and kurtosis, but these transformations don't appear to be effective
>>>in improving normality when there is only high or low kurtosis (i.e.,
>>>when skew is OK). Any suggestions?
>
>You'll probably hear this from a number of us (Hector Maletta has posted it
>a good many times), but briefly: It's probably unwise to do either.
>
>First, very few procedures depend on the data being normally distributed.
>Some assume that the *residuals after estimation* are normally distributed,
>but these are usually pretty forgiving. Forgiving, except that very large
>values have very large effects on the estimate. (I didn't say that outliers
>don't matter; I said that the solution isn't to trim values that are
>unusual only in being large.)
>
>As for nonlinear transforms of data, from myself, again(2),
>
>>I think it's doubtful policy to transform ANY of your variables to make
>>them look normally distributed.
>>
>>First, the assumptions of most methods (you say "predictive model" 
>>multiple regression?) don't include normal distribution of the variables.
>>They often assume normal distribution of the RESIDUALS, but in practice
>>aren't very sensitive to violations of this assumption (unless very large
>>residuals occur).
>>
>>If you transform a variable to make it look normal, you're throwing away
>>the scale information; you're "stretching" the scale of the variable
>>arbitrarily in different parts of its range. If you are willing to do
>>this, you're implicitly saying it's an ordinallevel variable, not
>>scalelevel, and you probably should be using nonparametric, i.e.
>>ordinal, analysis methods.
>>
>>You wrote, "we only need to transform dependent variable, if it is too
>>skewed". Again, I don't think you should do this, skewed or not. What you
>>should consider is a dependent variable that shows a relatively few values
>>that are very large compared to most others. Those values, often called
>>'outliers', will have a large, perhaps dominant, effect on the resulting
>>model. Depending on the underlying mechanism, that may be the most useful
>>answer; or, it may be more useful to fit a model omitting the 'outliers'.
>>I won't start talking about the appropriate circumstances for each
>>strategy, but in neither case is transforming the variable to "normality"
>>a good idea.
>
>In general, that is, it's currently recommended to apply nonlinear
>transformations to data only when that is believed to match the actual
>effect of the values. As a common example, income is often logtransformed,
>on the belief that a $1,000 increase in income has a very different effect
>on the behaviour of a person whose income is originally $10,000, and one
>whose income is originally $100,000; but a 10% increase should have a
>similar effect in both cases.
>
>Whew! Am I coming down too hard? But this is a subject to approach with
>great care, and there are now strong arguments against lightly removing
>outliers, or transforming to reduce skew, kurtosis, etc.
>
>Good luck!
>Richard Ristow
>...............................
>Cited postings:
>(1) Re: lg10 versus ln, Mon, 16 May 2005 01:07:31 0400
>(1) Re: Normalize Everything?, Thu, 14 Oct 2004 19:21:31 0400
>
_________________________________________________________________
The new MSN Search Toolbar now includes Desktop search!
http://toolbar.msn.co.uk/
