|
If you are interested, I will be happy to send you my contrast generator.
Using this tool, you specify the contrast in one of several ways. The macro
fills out the relevant components of the contrast. Using my macro, all
contrasts are always estimable. They may not be sensible, but they are
always estimable.
Paul A. Thompson, Ph.D.
Division of Biostatistics, Washington University School of Medicine
660 S. Euclid, St. Louis, MO 63110-1093
314-747-3793
paul@wubios.wustl.edu
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Dale
McLerran
Sent: Monday, May 19, 2008 12:39 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Contrasts, interactions, data step and where
--- Peter Flom <peterflomconsulting@MINDSPRING.COM> wrote:
> I am working on a paper for NESUG on PROC LOGISTIC.
>
> Mostly for PROC LOGISTIC, but also other PROCs, I am investigating
> writing contrast statements for interactions. It's tricky, and there
> is little guidance around, and it seems very easy to write the
> contrast incorrectly.
>
> So I'm wondering if there is any reason not to do this differently,
> using WHERE and possibly DATA step processing.
>
> That is, instead of trying to do something like
>
> PROC LOGISTIC data = dataset;
> model depvar = indvar1 indvar2 indvar1*indvar2;
> contrast XXXXXX;
> run;
>
> use something like
>
> PROC LOGISTIC data = dataset;
> model depvar = indvar1 indvar2 indvar1*indvar2;
> where indvar1 = XXX and indvar2 = YYY;
> run;
>
> of course, this is more limited, it might be hard to write it up for
> complex contrasts like "is the average of these two different from
> the average of those three" but, in my experience, people are not
> usually interested in that sort of thing.
>
> Any thoughts?
>
> Peter
>
> Peter L. Flom, PhD
> Statistical Consultant
> www DOT peterflom DOT com
>
Peter,
For the case where all predictor variables are categorical and
have complete interaction specification to the order of the number
of predictor variables, then the approach which you suggest will
produce the same numerator for any specified contrast as you would
obtain from complete data analysis. When this condition is met, a
cell means model is being fit. Because you are fitting a cell means
model, you cannot misspecify the mean by restricting the data to
only the cell(s) necessary for the contrast of interest. However,
the standard errors are likely to be poorly estimated.
Let's parse this a little bit. What does it mean to say that all
predictor variables are categorical and have complete interaction
specification to the order of the number of predictor variables?
This might be best illustrated through example. The following
code fragment with CLASS and MODEL statements exhibit the quality
specified in the condition:
class A B C;
model y = A | B | C;
Note that in this example, we have three predictor variables which
are all categorical. The model specification can be expanded as
model y = A B C A*B A*C B*C A*B*C;
Every categorical variable main effect is specified in the model,
every two-variable combination is specified, and the (only) three-
variable combination is specified in the model. The CLASS and
MODEL statements result in a model which predicts the cell mean
for every cell of the three way combination of predictors.
On the other hand, the code fragment below does not meet the criteria
outlined in the first paragraph:
class A B C
model y = A | B | C @2;
The above model statement will include all of the main effects and
all of the two-way interaction effects, but excludes the three-way
interaction. Because the three-way interaction is excluded, we
do not have a cell means model, which is a requirement for obtaining
the correct point estimates of effects for your restricted data
approach.
Now, how about the standard errors? For simplicity, let's assume
that the response Y is normally distributed and that you have two
two-level categorical predictor variables. Thus, the data can be
summarized as:
A
B A(1) A(2)
|------------|------------|
| | |
B(1) | Ybar(1,1) | Ybar(1,2) | Ybar(1,.)
| | |
|------------|------------|
| | |
B(2) | Ybar(2,1) | Ybar(2,2) | Ybar(2,.)
| | |
|------------|------------|
Ybar(.,1) Ybar(.,2) Ybar(.,.)
You might be interested in the contrast Ybar(1,1)-Ybar(1,2) in which
case you could fit a model in which you would restrict the data
to B=B(1) and then just fit a model in which you specify the A
main effect. You will get the correct point estimates Ybar(1,1)
and Ybar(1,2) that are required for this contrast. But what happens
to the standard error when you do this? You will compute the
residual error variance using only data where B=B(1). Because
you fit a cell means model, the residual variance is simply the
sum across all cells of within-cell sum of squares divided scaled
by the appropriate degrees of freedom. When the complete data are
employed, the numerator uses deviations from all four cells. When
you restrict the data to the above contrast, you are using only
the cells with Ybar(1,1) and Ybar(1,2) to compute your variance
estimate.
Now, for the contrast which we specified above, we used half of
the cells from our complete data table. Because we have a
continuous response, there is essentially a zero percent chance
that the variance of the data which are included in the contrast
of interest (B=B(1)) is the same as the variance of the data which
are excluded when we focus in on the specified contrast. There is
a 50% chance that we will overstate the residual variance and a
50% chance that we will understate the residual variance.
I should concede that you will have a residual variance which has
correct expectation. Thus, you will not bias your contrast toward
acceptance or rejection just because you have restricted your data.
But you will not obtain the same value for your contrast as the
person who has employed the complete data and constructed a contrast
statement that reflects the complete data design. I would argue
that the person who used the complete data to estimate their
residual variance would have the better test simply because their
estimate of the variance is more stable.
To this point, I have not even discussed designs where there is
anything other than a cell means model which is fit. But when the
complete data model is not a cell means model, then restricting
the data to only some subset will produce point estimates which
do not match the point estimates obtained when the complete data
are employed. I won't elaborate on this right now, but just raise
the issue for you to consider.
Just to summarize, I would not want to be presented with contrast
results which are based on restricting the data to some subset of
the entire data set. In the best of circumstances (where your model
is a cell means model and your contrast is a simple difference of
cell means), you will have the correct point estimate for the
numerator of your test but you will employ a poor estimate of the
variance in the test statistic denominator. In other circumstances
(where your model is not a simple cell means model), you will not
even get the same point estimate for the numerator as would be
obtained from complete data analysis.
Dale
---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: dmclerra@NO_SPAMfhcrc.org
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------
|