LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2005)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 28 Sep 2005 12:14:05 -0400
Reply-To:     Richard Ristow <wrristow@mindspring.com>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         Richard Ristow <wrristow@mindspring.com>
Subject:      Re: data screening help
Comments: To: Kathryn Gardner <KJGARDNER10@HOTMAIL.COM>
In-Reply-To:  <200509281432.j8SDbrjO009603@malibu.cc.uga.edu>
Content-Type: text/plain; charset="us-ascii"; format=flowed

At 10:32 AM 9/28/2005, Kathryn Gardner wrote:

>>I have a number of questions relating to data screening (i.e., >>outlier,normality, linearity checks) that I am hoping people can help >>me out with

We've discussed his subject before, and there are some pretty strong opinions. Let me start. This has come up enough that I find I can quote a number of my own old postings. I think you'll hear from Hector Maletta and Art Kendall, as well. For clarity, I'm double-quoting what you've written, single-quoting any replies that are quoted from previous posts.

>>2) The procedure for detecting outliers depends on whether data is >>continuous or categorical. If it is continuous this means data >>screening the sample as a whole, if categorical this means screening >>by group. I am using analyses that will involve both the use of >>continuous and categorical data, so how should I screen my data? I've >>been screening as a whole up until now. Besides, if I decided to >>screen using groups, at what point do I decide not the split the data >>into groups i.e., I could split my data according to gender, age, >>ethnicity, education, occupation, country etc etc.

First, when you say, "procedure for detecting outliers depends on whether data is continuous or categorical," it sounds like you don't mean quite what you say. Screening categorical data for outliers is rarely meaningful. It sounds like you mean screening for outliers in continuous variables within categorical groups.

Second, "outlier" is now questioned as a notion. In brief, some "outliers" can confidently be identified as erroneous data; they should be corrected from the source, or treated as missing. But other "outliers" are legitimate, though unusual, observed values of the quantity. Removing them can distort your analysis altogether. Quoting myself(1),

>"Outliers" get a good deal of discussion, in statistics. There are >several cases: > >. Values outside the range that's *a priori* possible, or far outside >reasonable experience. These may be assumed to be data errors, and >treated as missing values (correctly, as the true value is not known) >if it's not feasible to correct them from the original source. > >. "Cloud and outlier" distributions: Most of the observations are >concentrated in a range of modest size (the 'cloud') There are a few >with much larger values (the 'outliers'), and it is reasonable to >believe that the outliers are real, unusual values. Here, a regression >fit including all data is reasonable, BUT: because the outliers have >so much leverage, you're likely to get a model that's predictive or >explanatory for what distinguishes 'cloud' points from the outliers, >but is much less relevant within the 'cloud'. In this case, you're >wise to fit models with and without outliers, and interpret carefully >the differences between them. Simply throwing away 'outliers' that >can't be dismissed as errors, is currently frowned upon: you may give >yourself a badly distorted view of your population. However, >eliminating the 'outliers' because you're studying a part of the range >(the 'cloud) is defensible. That appears to be what you're doing. It's >just fine, subject to the criticism that it doesn't tell you much >about the extremes of the income range. > >. Distributions that cover a wide dynamic range, with cases observed >over the whole range. Here, you want to look for a model that covers >the whole range. This is where a log transform may be appropriate, if >it's defensible on theoretical grounds. A linear fit using the United >States income distribution would probably end up explaining the >largest incomes, and effectively lumping a lot of the range as "more >or less zero, within measurement error."

>>3) Related to the above Q, does the idea that screening for outliers >>depends on whether data is continuous or categorical apply to all >>data screening procedures i.e., normality analyses?

Again, are you're talking not about screening categorical data for outliers, but screening continuous data within categories? If not, I'm missing your point badly.

>>4) I have screened my data according to subscales rather than full >>scale scores e.g., checked the normality of each individual subscale >>on each questionnaire (some questionnaires don't produce full scale >>scores). I don't know whether this is standard practice, but to me it >>makes sense to screen by subscale. I do however, have a variable that >>does produce subscales, but I have had to use the full scale score in >>my data screening because I can't split it into subscales until I've >>factor analysed it. Is this OK?

So far, so good, but >>5) I've been using logarithm & square root transformations etc to >>reduce skew and kurtosis, but these transformations don't appear to >>be effective in improving normality when there is only high or low >>kurtosis (i.e., when skew is OK). Any suggestions?

You'll probably hear this from a number of us (Hector Maletta has posted it a good many times), but briefly: It's probably unwise to do either.

First, very few procedures depend on the data being normally distributed. Some assume that the *residuals after estimation* are normally distributed, but these are usually pretty forgiving. Forgiving, except that very large values have very large effects on the estimate. (I didn't say that outliers don't matter; I said that the solution isn't to trim values that are unusual only in being large.)

As for non-linear transforms of data, from myself, again(2),

>I think it's doubtful policy to transform ANY of your variables to >make them look normally distributed. > >First, the assumptions of most methods (you say "predictive model" -- >multiple regression?) don't include normal distribution of the >variables. They often assume normal distribution of the RESIDUALS, but >in practice aren't very sensitive to violations of this assumption >(unless very large residuals occur). > >If you transform a variable to make it look normal, you're throwing >away the scale information; you're "stretching" the scale of the >variable arbitrarily in different parts of its range. If you are >willing to do this, you're implicitly saying it's an ordinal-level >variable, not scale-level, and you probably should be using >non-parametric, i.e. ordinal, analysis methods. > >You wrote, "we only need to transform dependent variable, if it is too >skewed". Again, I don't think you should do this, skewed or not. What >you should consider is a dependent variable that shows a relatively >few values that are very large compared to most others. Those values, >often called 'outliers', will have a large, perhaps dominant, effect >on the resulting model. Depending on the underlying mechanism, that >may be the most useful answer; or, it may be more useful to fit a >model omitting the 'outliers'. I won't start talking about the >appropriate circumstances for each strategy, but in neither case is >transforming the variable to "normality" a good idea.

In general, that is, it's currently recommended to apply non-linear transformations to data only when that is believed to match the actual effect of the values. As a common example, income is often log-transformed, on the belief that a $1,000 increase in income has a very different effect on the behaviour of a person whose income is originally $10,000, and one whose income is originally $100,000; but a 10% increase should have a similar effect in both cases.

Whew! Am I coming down too hard? But this is a subject to approach with great care, and there are now strong arguments against lightly removing outliers, or transforming to reduce skew, kurtosis, etc.

Good luck! Richard Ristow ............................... Cited postings: (1) Re: lg10 versus ln, Mon, 16 May 2005 01:07:31 -0400 (1) Re: Normalize Everything?, Thu, 14 Oct 2004 19:21:31 -0400


Back to: Top of message | Previous page | Main SPSSX-L page