**Date:** Wed, 28 Sep 2005 20:17:13 -0400
**Reply-To:** Richard Ristow <wrristow@mindspring.com>
**Sender:** "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
**From:** Richard Ristow <wrristow@mindspring.com>
**Subject:** Re: data screening help
**In-Reply-To:** <BAY101-F41114E78C965382388CAB2A68D0@phx.gbl>
**Content-Type:** text/plain; charset="us-ascii"; format=flowed
Dear Kathryn,

At 04:58 PM 9/28/2005, Kathryn Gardner wrote:

>Thanks once again for taking the time to reply to my e-mail Richard,
>your
>help is much appreciated.

You're most welcome. No miracles, I'm afraid.

>I actually have skew, kurtosis, outliers etc on about 8 DVs and 3 IVs,
>but I was actually under the impression that distributional
>requirements applied to IVs as well.

I have always understood not. However, outlier cases on the DVs, like
those on the IVs, can have greatly disproportionate 'leverage' on the
results.

>By multivariate outliers I mean a case with a combination of extreme
>values on two or more variables.

If you have 8 DVs, there are a few things you can look for. One is, do
extreme values cluster among the DVs; that is, is there evidence for an
underlying mechanism that produced outliers in several DVs?

By the way, how far out are they lying? Do they look like a
low-frequency extension of the variables's general distribution? If so,
that's a case for retaining them. Or, are they many SDs from the mean,
looking completely isolated from what you'd consider the main body of
the data? That could be a case for postulating a separate,
low-frequency mechanism by which they arise.

In other words: what can you say about the structure of the outliers?
And, what can you say might possibly account for them? I can't solve
this, but you may be able to, knowing your study. It's certainly a
question. The one thing you can't do is ignore them. Think of your
report this way:

"This model explains xx% of the variance in the data, except on the y%
of the cases where very large values are observed, which have been
excluded from this analysis. No information is available on the
mechanism causing these large values. If they are included, the model
described explains zz% [probably much smaller] of the variance.

"Accordingly, we have fitted an alternative model, as described above
but with all data included. It explains ww% of the overall variance;
with the very large values exclude, it explains tt% of the variance in
the remaining values." (And you'd discuss the differences in the
models.)

This isn't a template. It's an illustration of what you get if you
simply trim 'outliers'.

These are, by the way, mainly questions to guide you, and your
colleagues and advisors. For us on the list to go much farther with
them would be to go beyond general advice, to moving in on your study.

>When I said non-linearity I was referring to the idea that variables
>that are part of one scale should usually (but not not always) have
>linear relationships (or atleast this is whatI can gather from my
>reading). I thought that linearity was an assumption of parametric
>analyses.

It is, for scale-level independent variables. (For categorical
independents, as in ANOVA, the question doesn't arise.) A non-linear
relationship, if known, needs to be dealt with. (You were very
non-specific in stating 'non-linearity'; it read as if you thought it
was a property of single variables.)

Here, again, are question for you: What is the reason for thinking the
relationship is non-linear? If it's based on theory, the theory may
will suggest an appropriate transformation. If the evidence for
non-linearity is from observation, it's standard to add non-linear
terms, the square and perhaps the cube of your independent variable, to
your model. Be careful! In many cases, a variable, its square, and its
cube are very highly correlated. Get advice on the ways to work around
this.

If tests on your model indicate that the non-linear terms should be
included, you need to consider the implications in your discussion.

>You said that "the assumption is normality of the residuals, not of
>the body of the data." I'm not too familar with residuals (not yet
>anyway), but aren't residuals usually inspected post-analysis via
>multiple regression? If so, this suggests that I can start my main
>analyses now and then screen for normality later. But if I did this,
>what would the implications be of finding residuals that suggested
>non-normality?

I'm getting out of my depth here. Try, say, Hector Maletta directly.
Briefly, as I wrote, the methods are mostly pretty robust against
modest deviations from normality. I certainly wouldn't worry simply
because the skewness or kurtosis statistics can be shown to be
non-zero. Do worry about long 'tails' away from the center of the
distribution - "outliers."

>I am happy not to transform or remove outliers, providing this still
>means that I run analyses such as ANOVA and MR.

In brief: you can. Special characteristics of the data, notably
outliers, will give you difficulties in interpretation, and you'll have
to address those. But if you eliminate them by tranforms to make your
data look 'normal', and arbitrarily removing large values before
analysis, you'll give yourself interpretation problems that are just as
bad, or worse.

There's no magic. If your data isn't simple, it isn't. I've given you
questions for investigation, not answers that will solve it.