| Date: | Wed, 1 Mar 2000 14:15:07 -0500 |
| Reply-To: | "Powhatan J. Wooldridge, Ph.D." <pjw@ACSU.BUFFALO.EDU> |
| Sender: | "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU> |
| From: | "Powhatan J. Wooldridge, Ph.D." <pjw@ACSU.BUFFALO.EDU> |
| Subject: | Re: multiple comparisons - take II |
|
|
| In-Reply-To: | <NDBBJJHHMLAPGLKICCPFMEECCBAA.geomar@sunshinecable.com> |
| Content-Type: | TEXT/PLAIN; charset=US-ASCII |
|---|
On Sat, 26 Feb 2000, Jack Wierzchowski wrote:
> Hello,
> Some time ago I submitted a question regarding adjusting the significance
> level to account for multiple comparisons. Dr. Hector Maletta was kind to
> respond to the question (thank you again) and suggested re-submiitting it to
> solicit more answers to it. Here it is again (any help will be most
> appreciated):
>
>
> "I have a generic question regarding what constitutes a multiple comparison
> (that is, when does one need to adjust the significance level to account for
> multiple comparisons:
> I have a set of bear radiolocations split into by age into three groups -
> juvenile, mature, old. Each radiolocation (case) has a number of attributes
> attached to it (distance to roads, elevation, habitat quality). These
> attributes are our set of independent ratio variables. For the 3-class
> grouping (age) I clearly see a need to adjust significance level for
> multiple comparisons being made on a given variable. What is not clear to me
> is whether the fact that I ran three separate ANOVAs (one on each variable)
> may influence the significance level for detecting the differences among the
> means . To simplify the issue and make it more generic - for a binary model
> (say, males versus females) - will testing for the differences between the
> means on three INDEPENDENT variables constitute a case of multiple
> comparison in which adjustments to significance levels are required? "
>
> I believe that the answer should be a "no" because an increase in the
> probability of finding a statistically significant difference occurs when
> THE SAME means are tested several times (again, if one has, say, three age
> groups and runs a test on the distance to roads, the fact that the same
> means get tested several times introduces an increased risk of detecting
> significant difference when in fact such difference is not present) in
> this case the tests are performed on a number of INDEPENDENT variables
> (unrelated means). However, some editors of the wildlife management
> journals insist that such adjustments to alphas are necessary, because THE
> SAME DATASET (radiotelemetry locations) is tested on many variables, which,
> in their mind constitute a multiple testing situation. Who is right?
> Jack"
>
Dear Jack--
The answer is that there is no clearcut answer. Who is "right" depends on
which kind(s) of "instability" wildlife journals need to protect against
in publishing research reports and/or interpreting the findings of such
reports. This is the sort of issue about which different editors and/or
journals might legitimately reach different decisions. (Caution: Unless
you are truly interested in this, stop here.)
My own point of view, for what that is worth, is that it would be more
dysfunctional than functional for wildlife journals to require that a
multivariate test of statistical significance be performed AND FOUND TO BE
SIGNIFICANT before proceeding to ANOVAS for each "attribute" in research
of the kind that you describe. This conclusion is based on the fact that
such a policy would tend to select articles which examine very few
variables (perhaps through a priori selection of those variables for which
a relationship is likely to exist and perhaps by ex post facto selection)
and to reject articles which include results on all the variables for
which data are readily available. This might even encourage researchers to
do separate studies and/or present separate articles on each of the
variables in question, in order to increase their power enough to publish.
In addition, there is a considerable selection bias in what gets
published, since journals prefer articles with statistically significant
findings. A requirement that the multivariate test must also be
significant when several relationships with the same variable are examined
in order for some of these to be treated as significant would exacerbate
this problem.
Speaking more generally, I believe that statistical inference should be
based much more on placing confidence intervals around parameter estimates
and less on tests of statistical significance per se. Statistical
significance tests the stability of the direction of the relationship, but
leaves its magnitude in question. Failure to reject the null hypothesis is
almost always due to lack of sample size, not to an absence of any
relationship in the population. For all these reasons and more, I consider
the emphasis currently placed on statistical significance to be
dysfunctional in many respects. I see scientific research as the
accumulation of generalizable evidence about the degree to which variables
are related, not as a series of dichotomous choices as to whether
variables are or are not related. I also see it as theory driven, not as
simply descriptive and/or operational. (I think of what Cook and Campbell
refer to as the "external validity" problem of "conceptual validity" as a
kind of internal validity problem, and organize my tests od significance
accordingly when I have several measures of the same theoretical variable,
much as in structural equation analysis (as in LISREL, for example). This
influences my views on the extent to which any given statistical inference
practice is likely to be functional (right) or dysfunctional (wrong).
As a consequence of all the forgoing, I tend to agree with your position
(at least I think I do), but not necessarily with the logic by which you
arrived at it. On the one hand, I would have no strong objection to
having a policy that mutivariate tests must always be run and published
when a single variable is related to several different variables, so long
as lack of significance on the multivariate test were not considered to
preclude further analyses for each variable taken separately. On the other
hand, I don't think that such tests are very helpful, unless the
relationships involve a common issue about which a single conclusion must
be reached. (This is similar to what I understand your point of view to
be.) This does not mean that the editors who think otherwise are "wrong"
in any way that could be proven mathematically, however.
You may be looking for more than the above, since Hector's answer failed
to satisfy you and you have presented a rather detailed argument for your
own point of view. I will go on, therefore, to a more technical
consideration of the points that you raise. (Warning: Listserve readers
may find the rest of this memo to be even more confusing and/or nit
picking than the above. If you're not really, really interested in this
topic, but somehow read this far anyway, STOP.)
Some preliminary matters need to be clarified before proceeding to your
main questions. The variables/"attributes" that you refer to (distance to
roads, habitat quality, and elevation) are presumably the results of
differences in bear age, rather than the consequences of bear age, so one
could claim that you are "wrong" to call them "independent" variables. The
parenthetical context of your statement ("unrelated means") suggests,
however, that you mean "independent" to refer to their relations to one
another, rather than to age. Even here, it seems unlikely that they are
truly independent in the statistical sense. Can it really be true that
these three "attributes" have no correlation whatsoever to one another in
the bear population? I suppose it may be remotely possible that their
intercorrelations are rather small, but I would certainly have expected
habitat quality to be rather substantially correlated with distance to
roads and elevation. By "independent", do you mean simply that these are
distinct variables, rather than variables which are closely linked to one
another by theory or operational overlap? To digress somewhat, one could
also argue that these variables are not truly bear "attributes" (as age
and gender would be, for example) so that you were "wrong" to refer to
them that way.
Your choice of terminology may be standard practice among wildlife
researchers, and thus unlikely to mislead in that context. To the extent
that your question is statistical, however, the way you are using these
terms is potentially confusing (at least to me). This is particularly true
with respect to the term "independent". I don't mean to be unduly
argumentative; I just want to point out that nearly everything we do or
say in research, including the terms that we use, varies in "correctness"
according to functional context. You might want to keep that in mind,
because it applies also to the issue of which omnibus tests of
significance are "right" for the type of research that you do.
Your memo doesn't make it clear whether or not you are starting with some
a priori hypotheses about the relationships you are likely to find between
your three categories of bear age and the various "independent" variables
you cite. Most statisticians believe that there are major differences in
the right way(s) to use tests of statistical significance according to
whether one is using them to test a priori hypotheses or using them in the
absence of any a priori hypotheses in "fishing expeditions" to test which
of a number of variables of interest are related, and the form of those
relationships. [Perhaps with bears the term "hunting expeditions" would be
be preferrable. :-) ]
It is generally considered appropriate to use a priori theory to narrow
the variable pairings and patterns of relationships to be investigated, as
well as the "null" hypotheses to be considered as potential alternatives.
Your discussion makes it clear, for example that you wish to focus on any
pattern of differences in attribute means between the three categories
into which you collapsed the continuous variable "age". This was
presumably done because a priori knowledge suggested that the
relationships between age and "attribute" variables of the kind under
consideration are likely to be discontinuous and nonmonotonic, and was not
influenced by sample characteristics. (If this focus was determined AFTER
examining the actual data patterns in your sample, then the meaning of
your tests of statistical significance would change dramatically. Indeed,
many statisticians would say that none of the tests you discuss would be
"right" in that circumstance.)
Your choice of statistical procedures suggests that you are more concerned
with the effects of age on "averages" than with its effects on
"variability". Once again, I assume that either effects on averages are
the ones that have the most clearcut implications for the applications
that concern wildlife researchers, or that the relationship of age to the
variablity of these attributes is known to be small. Whether or not that
is so, however, you should be aware that you are at least implicitly using
a priori considerations to focus your inquiry.
If you had gone a step or two further, and proposed a priori hypotheses
about how each attribute was expected to be related to juvenile/mature/old
age, then you could have used planned contrasts specific to the a priori
patterns of differences in means that you had predicted. This would have
increased power if you got the predicted pattern, but decreased it to near
zero if a different pattern emerged. Since you did not do so, however, I
think that the common practice of requiring a statistically significant
omnibus test (the F test from a one way ANOVA) and/or using the Scheffe
test or its equivalent to compare between group means is usually
functional. I don't see even that as necessarily a given, however.
Suppose you found that you unexpectedly had only one or two bears in one
of the three categories, and ample sample sizes in the other two, for
example. (I don't suppose that this is likely in your research, but it
happens all the time in mine.) I would not think it "wrong" just to test
for a difference in means between the two categories for which the sample
sizes were large enough to give you enough power to stand a good chance of
getting significance. I would NOT insist on your using all three groups,
with a consequent loss of overall power, just because that was what you
had originally planned to do when you thought that the N's would be
approximately equal.
In the above circumstance, I would (in my journal editor incarnation)
suggest that you report the means for the third group, but comment that
small sample sizes had precluded their inclusion in significance testing,
and warn that their means were too unstable due to small sample size to
warrant meaningful comparisons. (Note: This should be done BEFORE
determining that excluding the third group would actually increase the
level of statistical significance. In other words, this strategy should be
declared, and hypotheses adjusted accordingly, just as soon as it becomes
clear that there aren't going to be enough cases in one of the groups. It
would be inappropriate to change back if a one way ANOVA with all three
categories would have been statistically significant, whereas testing just
the two large N categories against one another was not.) Would a wildlife
editor agree with the above? My guess is that some would, and some would
not. Who would be "right"? Well I could justify the above strategy as more
functional than not, but I am reasonably sure that some statisticians
would say that I am flat out wrong.
I know that I have wandered away from the question you asked, but I have
deliberately broadened my answer to include other similar situations in
which the form and/or number of the test(s) to be used can in some
manner raise the issue of lumping multiple test and/or comparisons under
a single test, in order to controll the overall error rate, rather than
just considering each kind of error seperately. The point that I am
trying to make is that the kind of question you are raising comes up in
multiple contexts. Even in contexts where the standard practice is
to start with an omnibus test, or otherwise take the overall error rate
into account, treating each issue/comparison separately would not be
demonstrably WRONG.
To get back to your specific question, if your research had a priori
hypotheses, and a theoretical basis, the issue of whether or not the same
theoretical/general proposition underlies all three hypotheses might then
be raised. If all three analyses related to a single underlying more
general hypothesis, and if you had no a priori reason to believe any of
the analyses in question to be more accurate in testing that hypothesis
than any other, then it would (in my opinion) be correct to use a
multivariate test for the purpose of testing the theoretical/general
proposition in question. In testing several operational hypotheses which
all relate to a single underlying general (theoretical) hypothesis,
multivariate analyses seem a logical way of testing the underlying general
hypothesis, about which a single conclusion needs to be reached.
Your message makes it clear, however, that you do not see any underlying
general proposition to which all three variables relate. In my opinion, it
would therefore be correct to run three separate tests for the three
separate hypotheses without bothering to run a multivariate test. In
testing operational hypotheses which are NOT interrelated, then
multivariate tests do not serve any useful theoretical function, since the
truth or falsity of each hypothesis involves a separate and distinct
issue. I would not, therefore insist on a multivariate test if I were a
wildlife journal editor. That is not because it would be demonstrably
WRONG to do so in any mathematical sense, but because I consider the
"multivariate tests first, regardless" approach to be dysfunctional to the
purpose of fostering cumulative scientific inquiry. There are those who
would disagree with that, however. Some might even think that much of what
I have said above is "wrong", because it underestimates their concept of
THE error rate against which researchers MUST protect themselves. In my
opinion, the best approach is to CHOOSE which kind(s) of chance errors you
need to protect against for a given PURPOSE, then run the test that does
exactly what you want, no more and no less.
To end where I began, there is no clearcut right or wrong answer to the
question you raise, or to issues of functionality in general. They all
depend on context and criteria.
***************************************************************************
Powhatan J. Wooldridge, Assoc. Professor, Nursing, State Univ. NY at
Buffalo
|