Date: Mon, 28 Aug 2006 09:20:18 -0400
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Art Kendall <Art@DrKendall.org>
Organization: Social Research Consultants
Subject: Re: outliers??
Content-Type: text/plain; charset=UTF-8; format=flowed
Very well put. On a case selection model, a measurement
operationalization model, and an analytic model. The concept of an
"inlier" is also critical to understanding.
SPSS is to be commended for including the anomaly detection (AD)
procedure and find duplicate cases (FDC). Inclusion of these helps to
reinforce the idea that the data needs to be cleaned, checked and
explored. AD and FDC facilitate quality assurance.
Although it is possible to workaround to compare files that are supposed
to be double keying, it is not as straight forward as it should be. I
strongly urge SPSS to implement a single syntax command procedure that
compares 2 files and reports differences 1) in the dictionary 2) in the
Double keying is a venerable QA procedure When SPSS was run on card
images in 1972, it was routine to compare the input data cards and the
output from WRITE FILEINFO using routines from the operating system.
Whereas FDC looks for situations where there is duplication and should
not be, the procedure I am urging looks for situations where there is
Social Research Consultants
Peck, Jon wrote:
>To add to all this good advice,
>In many cases whether something is an outlier or not depends on a model. It may be an extreme value not explained by the model. It may be much more complicated than a univariate extreme. The new Anomaly Detection procedure in SPSS can help to find these in a multivariate framework, although it is still up to you to decide what to do about it.
>In a context such as regression, it is good to look at the leverage statistics to see whether potential outliers actually affect your results much or not.
>Finally, consider the process assumed to be generating the data. It is commonly observed that stock market prices follow a random walk model, which means that the variance is not finite. Such a fat-tailed distribution will intrinsically have more outliers than we are probably accustomed to seeing, but that is part of the phenomenon to model.
>Note that outlier and unusual are not quite the same thing. You might have a very lonely and suspicious value buried in the middle of your data in a sparse region. Is that an outlier? It might be equally suspicious.
>From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of Art Kendall
>Sent: Sunday, August 27, 2006 8:06 AM
>Subject: Re: [SPSSX-L] outliers??
>I left the previous two excellent responses in this message.
>I have pasted one of my soapbox statements below which gives another
>Social Research Consultants
>"Outliers" is a very problematic concept. There are a wide variety of
>meanings ascribed to the term.
>Based on consulting on stat and methodology for over 30 years, I believe
>the usual explanation when there are suspicious values is failure of the
>quality assurance procedure. . I think of a potential outlier as a
>surprising or suspicious value for a variable (including residuals).
>In my experience, in the vast majority of instances, they indicate data
>gathering or data entry errors, i.e., insufficient attention in quality
>assurance in data gathering or data entry. In my experience, rechecking
>qa typically eliminates over 80% of suspicious data values. This is one
>reason I advocate thorough exploration of a set of data before doing
>the analysis. By thorough exploration I mean things like frequencies,
>multi-way crosstabs, scatterplots, box plots, rechecking scale keys and
>Derived variables such as residuals and rates, should be subjected to
>the same thorough examination and understanding-seeking as raw
>variables. This identifies suspicious values.
>Unusual values may be "real". They should not be simply tossed. In
>cluster analysis, sometimes there are singleton clusters, e.g., Los
>Angeles county is distinct from other counties in the western states.
>Some times there are 500 lb persons. There might be a rose growing in a
>cornfield. There may be strong interaction (synergy) effects.
>The first thing to do about outliers is to prevent them by careful
>quality assurance procedures in data gathering and handling.
>A thorough search for suspect data values and potentially treating them
>as outliers in analysis is an important part of data quality assurance.
>Values for a variable are suspect and in need of further review when
>they are unusual given the subject matter area, outside the legitimate
>range of the response scale, show as isolated on scattergrams, have
>subjectively extreme residuals, when the data shows very high order
>interaction on ANOVA analyses, when they result in a case being
>extremely influential in a regression, etc. Recall that researchers
>consider Murphy a Pollyanna.
>The detection of odd/peculiar/suspicious values late in the data
>analysis process is one one reason to assure that you can go all the way
>back and redo the process. Keeping all of the data gathering
>instruments, and preserving the syntax for all data transformation are
>important parts of going back and checking on "outliers". The
>occurrence of many outliers suggests the data entry was sloppy. There
>are likely to be incorrectly entered values that are not "outliers".
>Although it is painful, another round of data entry and verification may
>be in order.
>Correcting the data.
>Sometimes you can actually go back to redo the measurements. (Is there
>really a 500 pound 9 year old?). You should always have all the paper
>from which data were transcribed.
>On the rare occasions when there are very good reasons, you might modify
>the value for a particular case. e.g., percent correct entered as 1000%
>Modifying the data.
>Values of variables should be trimmed or recoded to "missing" only when
>there is a clear rationale. And then only when it is not possible to
>redo the measurement process. (Maybe there really is a six year old who
>weighs 400 lbs. Go back and look if possible.)
>If suspected outliers are recoded or trimmed, the analysis should be
>done as is and as modified to see what the effect of the modification
>is. Changing the values of variables suspected to be outliers frequently
>leads to misleading results. These procedures should be used very
>Math criteria can identify suspects. There should be a trial before
>there is a verdict and the presumption should be against outlier status
>for a value.
>I don't recommend undesirable practices such as cavalierly trimming to 3
>SDs. Having a value beyond 3 SD can be reason to examine a case more
>It is advisable to consult with a statistician before changing the
>values of suspected outliers.
>If you have re-entered the data, or re-run the experiment, and done very
>thorough exploration of the data, you are stuck as a last resort with
>doing multiple analyses: including vs excluding the case(s); changing
>the values for the case(s) to hotdeck values, to some central tendency
>value, or to max or min on the response scale (e.g., for achievement,
>personality, or attitude measures), modeling the specialness of the
>particular value, etc.
>In the small minority of occasions where the data can not be cleaned up,
>the analysis should be done in three or more ways (include the
>outliers as is, trim the values, treat the values as missing, transform
>to ranks, include in the model variables that flag those cases, or
>...). The reporting becomes much more complex. Consider yourself very
>lucky if the conclusions do not vary substantially.
>Social Research Consultants
>Hector Maletta wrote:
>>"Outliers" have never been defined satisfactorily, and the concept is seldom
>>used in a consistent way. Outliers are not "impossible" values, such as a
>>widower who is 4 years old, or a mother who is younger than her daughter.
>>Those are most likely data-entry or data-taking errors.
>>Outliers are, most properly, extreme values. In a sample about heights they
>>are individuals measuring over seven feet, or dwarfs. In an income sample
>>they are people like Bill Gates. They are not impossible, they are simply
>>Of course, and extreme value may also be a simple mistake: a person 112
>>years old may be just 12, and someone who is 7'11" tall may be a more common
>>5'11" just wrongly written or typed. But they just might exist, extremely
>>old, extremely tall, extremely wealthy.
>>Now, what is wrong with finding rare cases? If they exist, they should be
>>dutifully recorded in your data, not hidden under the carpet. The problem is
>>that they may distort your sample results if you are not careful in their
>>treatment. If you have a 1/10,000 sample of a certain area, in order to
>>estimate the distribution of heights, and stumble on the one and only dwarf
>>in the neighborhood, you may end up estimating that the area is populated by
>>10,000 small people, or (in another example) by 10,000 people with the
>>income of Bill Gates. They may alter the shape of your curve, or disfigure
>>your mean or standard deviation.
>>>From another point of view, if you start again and draw your sample anew,
>>chances are you won't stumble again on the only giant or the only dwarf in
>>town. Of all possible random samples of the same size, just very few will
>>include them, precisely because such subjects are rare, perhaps unique. If
>>you have some grounds to know that they are extremely rare in the general
>>population from which your sample comes, you may decide to exclude them from
>>the sample, though this is seldom advisable without careful statistical
>>One interesting exercise is considering the impact of their removal on the
>>mean and standard deviation of important variables, and on the slope of key
>>regression coefficients in your research. Suppose you are investigating the
>>relationship between capital and technology, and discover a very strong
>>relationship: more money, more high tech, but then you discover that the
>>whole thing crumbles down when you withdraw Bill Gates from the sample: he
>>was that solitary point to the Northeast of your scatterplot, while all
>>other capitalists in your sample made their money in old fashioned low-tech
>>businesses. Out goes Bill, and your money-tech beta falls into
>>non-significance (just a fictional example, of course). SPSS Regression, for
>>instance, lets you see the impact of removing each case on the overall fit
>>of a regression model. A high-impact case is probably an outlier worth
>>considering for closer inspection (and possible removal if suspect).
>>Even if SPSS may identify high-impact cases, all this requires human
>>intelligence. No surefire statistical device can do it for you.
>>Hope this helps.
>>De: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] En nombre de
>>Enviado el: Sunday, August 20, 2006 5:31 PM
>>Asunto: Re: outliers??
>>At 04:19 AM 8/14/2006, Samuel Solomon wrote:
>>>I am clueless on how to handle outliers (especially when it comes to
>>>prices) through the help of SPSS. Is there any surefire way?
>>OK, there are better statisticians than I on this list, but to start
>>There is not, and never will be, a surefire way, in SPSS or anywhere
>>else. 'Outlier' values may be the most important data, and you may
>>distort your analysis very badly by dropping them. I take the liberty
>>of re-posting and essay on outliers I wrote some time ago; I hope it is
>>at least partly germane to your needs.
>>*Whether*, and *by what standard*, to identify outliers, is at least as
>>important as 'how'.
>>. Outliers that fail consistency checks: For example, an event date
>>prior to the beginning of the study, or later than the present, can be
>>rejected as wrong. (I've got 'event dates' on the brain from a project
>>I'm working on. And, of course, I'm assuming that 'event dates' in the
>>past or future aren't valid; in some study designs, they may be.) Those
>>should be made missing; or they should be checked against the primary
>>data source and corrected, if that is feasible.
>>. Outliers that can't be rejected *a priori*: First, you shouldn't even
>>try to look at those until you reject any demonstrable errors.
>>Second, I would say a good way to look for them is to look at the
>>high-percentile cutpoints in the distribution. Depending on the size of
>>your dataset, 'high-percentile' could be 99%, 99.9%, and 99.99%. (These
>>are not alternatives. If you use, say, 99.9%, you should look at 99% as
>>well. Consider also looking at the 90% or 95% cutpoint, for a sense of
>>the 'normal' range of the distribution. 5% outliers are NOT outliers.
>>And, of course, look at both ends: 1%, 0.1%, 0.01% percentile
>>cutpoints, as well.)
>>Third, I think I'm seeing a trend in the statistics community against
>>removing 'outliers' by internal criteria (n standard deviations, 1st
>>and 99th percentiles). The rationale, and it's a strong one, is that
>>those are observed values of whatever it is that you're measuring. If
>>you eliminate them, you'll get a model based on their rarity; and that
>>model, itself, can become an argument for eliminating them (because
>>they don't fit it), and you can talk yourself into a model that's quite
>>unrepresentative of reality.
>>Fourth, however, the largest values will have a disproportionate,
>>possibly dominant, effect on most linear models -- regression, ANOVA,
>>even taking the arithmetic mean. Depending on your study, you can
>>- Go ahead. In this case, the model's fit will be weighted toward
>>predicting the largest values, and may show little discrimination
>>within the 'cloud' of more-typical values. That, however, may be the
>>right insight to be gained from the data.
>>- If available, use a non-parametric method. That's often favored,
>>because it neither rejects the large values nor gives them
>>disproportionate weight. By the same token, however, if much of the
>>useful information is in the largest values, non-parametric methods can
>>unduly DE-emphasize these values.
>>- There are reasons to reject this as heresy, but if you're doing
>>linear modelling, I'd probably try it both with the largest values
>>retained and with them eliminated. (I'd only do this if the 'largest
>>values' look very far from the 'cloud' of regular values. A scatter
>>plot can be an invaluable tool for this.) If the two models are closely
>>similar, you have an argument that there's a single process going on,
>>with the largest values being part of the same process. If they're very
>>different, you may have two processes, one of which operates
>>occasionally to produce the largest values, the other of which operates
>>'normally' but is swamped when the larger process happens. And if the
>>run without the large values produces a poor R^2, you may have an
>>argument that the observable process is represented by the largest
>>values, and the variation in the 'normal cloud' is mostly noise.
>>- [ADDED] Investigate carefully using a 'bootstrap' method - sampling
>>from your sample. Very large data values occurring in very small
>>proportion can give you a huge variance in your estimates, that won't
>>be detected by standard analytic methods. With a very small proportion
>>of large values, the expected number in a sample may be very small,
>>with a large variance. (Let's see - Poisson distributed for all
>>practical purposes, I think.) Particularly, because of the 'leverage'
>>of the very large values, the estimates in a sample that includes one
>>or more may be drastically different from those in a sample that
>>happens not to include any. You'll have to see whether that's a problem
>>in your data. If it is, it may help to do a stratified sample in which
>>the large values are over-represented, and then assigned lower weights
>>in proportion. If, that is, you can identify a subgroup of 'large
>>values,' and have access to enough of them to get a significant sample.
>>Onward, and good luck,
>>(*) Ristow, Richard, "Re: Outliers", on list "SAS(r) Discussion"
>><SAS-L@LISTSERV.UGA.EDU>, Mon, 15 Nov 2004 14:26:51; reposted as "Re:
>>Question on excluding extremes", SPSSX-L Fri, 17 Feb 2006 11:21:29