Date: Wed, 28 Sep 2005 12:14:05 -0400
Reply-To: Richard Ristow <email@example.com>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Richard Ristow <firstname.lastname@example.org>
Subject: Re: data screening help
Content-Type: text/plain; charset="us-ascii"; format=flowed
At 10:32 AM 9/28/2005, Kathryn Gardner wrote:
>>I have a number of questions relating to data screening (i.e.,
>>outlier,normality, linearity checks) that I am hoping people can help
>>me out with
We've discussed his subject before, and there are some pretty strong
opinions. Let me start. This has come up enough that I find I can quote
a number of my own old postings. I think you'll hear from Hector
Maletta and Art Kendall, as well. For clarity, I'm double-quoting what
you've written, single-quoting any replies that are quoted from
>>2) The procedure for detecting outliers depends on whether data is
>>continuous or categorical. If it is continuous this means data
>>screening the sample as a whole, if categorical this means screening
>>by group. I am using analyses that will involve both the use of
>>continuous and categorical data, so how should I screen my data? I've
>>been screening as a whole up until now. Besides, if I decided to
>>screen using groups, at what point do I decide not the split the data
>>into groups i.e., I could split my data according to gender, age,
>>ethnicity, education, occupation, country etc etc.
First, when you say, "procedure for detecting outliers depends on
whether data is continuous or categorical," it sounds like you don't
mean quite what you say. Screening categorical data for outliers is
rarely meaningful. It sounds like you mean screening for outliers in
continuous variables within categorical groups.
Second, "outlier" is now questioned as a notion. In brief, some
"outliers" can confidently be identified as erroneous data; they should
be corrected from the source, or treated as missing. But other
"outliers" are legitimate, though unusual, observed values of the
quantity. Removing them can distort your analysis altogether. Quoting
>"Outliers" get a good deal of discussion, in statistics. There are
>. Values outside the range that's *a priori* possible, or far outside
>reasonable experience. These may be assumed to be data errors, and
>treated as missing values (correctly, as the true value is not known)
>if it's not feasible to correct them from the original source.
>. "Cloud and outlier" distributions: Most of the observations are
>concentrated in a range of modest size (the 'cloud') There are a few
>with much larger values (the 'outliers'), and it is reasonable to
>believe that the outliers are real, unusual values. Here, a regression
>fit including all data is reasonable, BUT: because the outliers have
>so much leverage, you're likely to get a model that's predictive or
>explanatory for what distinguishes 'cloud' points from the outliers,
>but is much less relevant within the 'cloud'. In this case, you're
>wise to fit models with and without outliers, and interpret carefully
>the differences between them. Simply throwing away 'outliers' that
>can't be dismissed as errors, is currently frowned upon: you may give
>yourself a badly distorted view of your population. However,
>eliminating the 'outliers' because you're studying a part of the range
>(the 'cloud) is defensible. That appears to be what you're doing. It's
>just fine, subject to the criticism that it doesn't tell you much
>about the extremes of the income range.
>. Distributions that cover a wide dynamic range, with cases observed
>over the whole range. Here, you want to look for a model that covers
>the whole range. This is where a log transform may be appropriate, if
>it's defensible on theoretical grounds. A linear fit using the United
>States income distribution would probably end up explaining the
>largest incomes, and effectively lumping a lot of the range as "more
>or less zero, within measurement error."
>>3) Related to the above Q, does the idea that screening for outliers
>>depends on whether data is continuous or categorical apply to all
>>data screening procedures i.e., normality analyses?
Again, are you're talking not about screening categorical data for
outliers, but screening continuous data within categories? If not, I'm
missing your point badly.
>>4) I have screened my data according to subscales rather than full
>>scale scores e.g., checked the normality of each individual subscale
>>on each questionnaire (some questionnaires don't produce full scale
>>scores). I don't know whether this is standard practice, but to me it
>>makes sense to screen by subscale. I do however, have a variable that
>>does produce subscales, but I have had to use the full scale score in
>>my data screening because I can't split it into subscales until I've
>>factor analysed it. Is this OK?
So far, so good, but
>>5) I've been using logarithm & square root transformations etc to
>>reduce skew and kurtosis, but these transformations don't appear to
>>be effective in improving normality when there is only high or low
>>kurtosis (i.e., when skew is OK). Any suggestions?
You'll probably hear this from a number of us (Hector Maletta has
posted it a good many times), but briefly: It's probably unwise to do
First, very few procedures depend on the data being normally
distributed. Some assume that the *residuals after estimation* are
normally distributed, but these are usually pretty forgiving.
Forgiving, except that very large values have very large effects on the
estimate. (I didn't say that outliers don't matter; I said that the
solution isn't to trim values that are unusual only in being large.)
As for non-linear transforms of data, from myself, again(2),
>I think it's doubtful policy to transform ANY of your variables to
>make them look normally distributed.
>First, the assumptions of most methods (you say "predictive model" --
>multiple regression?) don't include normal distribution of the
>variables. They often assume normal distribution of the RESIDUALS, but
>in practice aren't very sensitive to violations of this assumption
>(unless very large residuals occur).
>If you transform a variable to make it look normal, you're throwing
>away the scale information; you're "stretching" the scale of the
>variable arbitrarily in different parts of its range. If you are
>willing to do this, you're implicitly saying it's an ordinal-level
>variable, not scale-level, and you probably should be using
>non-parametric, i.e. ordinal, analysis methods.
>You wrote, "we only need to transform dependent variable, if it is too
>skewed". Again, I don't think you should do this, skewed or not. What
>you should consider is a dependent variable that shows a relatively
>few values that are very large compared to most others. Those values,
>often called 'outliers', will have a large, perhaps dominant, effect
>on the resulting model. Depending on the underlying mechanism, that
>may be the most useful answer; or, it may be more useful to fit a
>model omitting the 'outliers'. I won't start talking about the
>appropriate circumstances for each strategy, but in neither case is
>transforming the variable to "normality" a good idea.
In general, that is, it's currently recommended to apply non-linear
transformations to data only when that is believed to match the actual
effect of the values. As a common example, income is often
log-transformed, on the belief that a $1,000 increase in income has a
very different effect on the behaviour of a person whose income is
originally $10,000, and one whose income is originally $100,000; but a
10% increase should have a similar effect in both cases.
Whew! Am I coming down too hard? But this is a subject to approach with
great care, and there are now strong arguments against lightly removing
outliers, or transforming to reduce skew, kurtosis, etc.
(1) Re: lg10 versus ln, Mon, 16 May 2005 01:07:31 -0400
(1) Re: Normalize Everything?, Thu, 14 Oct 2004 19:21:31 -0400