Date: Wed, 4 Apr 2007 20:43:06 -0400
Reply-To: Richard Ristow <wrristow@mindspring.com>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Richard Ristow <wrristow@mindspring.com>
Subject: Re: Chi-squared and Chi-squared test for trend comparison
In-Reply-To: <F7558203F6DCE54D87688CE56F289A780DC3DD@itexcn01.uchc.net>
Content-Type: text/plain; charset="us-ascii"; format=flowed
At 03:18 PM 4/4/2007, Burleson,Joseph A. wrote:
>The last 6 large (n = 400 to 6,000) clinical trials I analyzed all had
>age perfectly normally distributed (i.e., skewness between -.20 and
>+.20).
Well... The skewness measure may not be conclusive. The skewness is
zero for any symmetric distribution. That includes uniform
distributions, or choosing each one of two values with probability 0.5,
or any number of easy to construct long-tailed distributions.
I've no idea of the design of the studies you were on, but most
clinical trials select an age range, explicitly or by implication. The
population pyramid being fairly flat over much of its range, that tends
toward a uniform, or nearly uniform, age distribution. If it's a very
wide age range in an adult population, you'll probably see some upward
skewing.
So, I might collect after all. Did you run a Kolmogorov-Smirnov, or
other specific, test for normality?
(I might add that an approximately uniform age distribution will be
just fine for analysis, and there was no need to go beyond the skewness
check, for your purposes. The worst problem would be age outliers;
people near the end of the observed age range have very different
medical problems, of course. But the selection procedures surely
excluded those.)
>I, too, have seen age not be normal (e.g., Poisson distributions,
>U-shaped distributions, etc.). One cannot assume that it is one way or
>the other for no specific reason.
No. On the other hand, the age distribution, whatever it is, is usually
that way because of a selection criterion applied to an overall
population pyramid, and a clear grasp of the explicit or implicit
selection rule, is crucial.
(Stories: Like a study of at-risk - premature - neonates, that showed a
strong negative correlation between birth weight, and gestational age
at birth.)
>Sorry to be so nit-picky, but the Central Limit Theorem has nothing at
>all to do with whether a population OR a single sample is normal or
>non-normal.
Actually, I've often seen it argued, that it does. Admittedly the
argument is a little hand-wavy, as it deals with effects that can only
be hypothesized to exist.
>The CLT has to do with "sampling" distributions.
First, no; the CLT has to do with distributions of the sums (or means)
of random variables; sampling distributions are one instance.
Now, bear with me, and I'm taking a point of view standard among
probability theorists, but that often seems strange to statisticians:
the observations are not selected from a 'population', considered as a
finite, potentially identifiable set of subjects; but are drawn,
generated, according to distribution and dependency rules.
Consider residuals, then - 'random variation' added to an underlying
value that we actually want. (This is the standard premise of linear
models.) Why would we remotely expect these to be normally distributed?
Here's the hand-wavy part: If there are actually many unobserved
factors whose effects add to form the residuals, they are statistically
independent, and their variances are comparable ("uniformly bounded" is
the correct notion), then the hypotheses of the CLT apply, and we may
with some justice expect approximately normal residuals.
This model, of residuals that are the sum of many small random effects,
suggests a likely problem: what if they aren't all of comparable size?
Indeed, one of the more common observed deviations from normal
residuals, is long 'tails' - probability of very large residuals much
greater than given by the normal distribution. That is what you get if
you have one, or a few, influences that occur rarely but have high
variance when they do occur.
This model also suggests circumstances where its unwise to expect
normal residuals. For example, you've good hope that a scale made by
summing Likert-scale responses will be something like normally
distributed around its mean; but there's little chance that's true for
a single Likert scale.
Which brings us back to age. Subject ages aren't 'generated'; subjects
really are selected from a population with a known, usually
nowhere-near-normal, distribution of ages. Further, the selection is
almost always for a sub-range of the distribution.
It's hard to argue that the resulting distribution should be normal.
Hard enough, that if I saw a normal distribution of ages in a study,
I'd look skeptically at the selection criterion.
Now, an unskewed distribution, that I can readily believe. But I think
it'll usually look much more like uniform than like normal.
I'm interested in your comments, and anybody's, on what age
distributions are common in real studies.