Date: Tue, 4 May 2010 12:43:40 -0400
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Peter Flom <peterflomconsulting@MINDSPRING.COM>
Subject: Re: interaction & main effect question
Content-Type: text/plain; charset="UTF-8"
Robert Feyerharm wrote
I have a question concerning the inclusion of main effects & interaction
terms in a model. The consensus seems to be that main effects should be
included in a model if the interaction term is statistically significant.
Rather than this, I'd say main effects should be included if the interaction is included. There are some people who point out some rare exceptions (there was a paper by David Rindskopf on this, perhaps a decade ago) but this is the usual.
My understanding (from reading Hosmer & Lemeshow's text on logistic
regression) is that first a main effects model should be tested, and then
interaction effects can be tested from any main effects that were found to
be significant. As opposed to adding main effect variables post-hoc after
an interaction term is found to be significant.
Is this a valid way to approach to model construction that will
consistently identify all possible interactions?
This is my understanding of Hosmer and Lemeshow as well.
But it is not a process I can endorse. It relies much too heavily on the notion of statistical significance; in my view, this should play virtually no role in model building. Further, it is entirely possible to have important interactions when there are no main effects. It is also possible to have important effects that are not significant. Further, it is possible to have a small effect be important. For instance, if the literature shows that a certain effect is large, and you show it is small, then including that may be very interesting to the progress of science.
It makes sense from a practical viewpoint. Testing an initial model with
main effects *and* all possible interactions thrown in would seem to risk
over specification. For example, with only 10 main effect variables in a
proposed model, there are (10 2)= 10!/2!8! = 45 possible interaction terms
which could be added to the initial model in addition to the main effects.
That's way too many terms IMO.
Nevertheless, I can certainly visualize, from a geometric standpoint, a
regression model which includes an interaction term but *no* main effect
terms (that is, y = beta3*x1*x2). See the third graph from the top on the
following page from UCLA's Academic Technology Services:
David Cox said
"There are no routine statistical questions, only questionable statistical routines".
Certainly looking at ALL the two way interactions (to say nothing of three-way and higher interactions) leads to a very complex model, and one in which there are almost certainly too many terms.
But we must let research guide statistics, and not the other way around. What are the questions of interest? What interactions make sense? Which might be important if they were found?
It is true that some statistical analysis is exploratory. But I would maintain that NO analysis is completely exploratory. Why were these data and not others collected? Except for the very worst type of data mining, we collect and look at data for SOME reason. We do not throw the statistical abstract of the United States into a blender and press FRAPPE. (I am not quite sure what FRAPPE means, but I've seen it on blenders).
Robert Abelson titled his book "Statistics as principled argument". That is what statistics should be - part of a principled argument about what the data mean.