LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2008, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Mon, 19 May 2008 12:45:32 -0500
Reply-To:   "Paul A. Thompson" <paul@WUBIOS.WUSTL.EDU>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Paul A. Thompson" <paul@WUBIOS.WUSTL.EDU>
Subject:   Re: Contrasts, interactions, data step and where
Comments:   To: Dale McLerran <stringplayer_2@YAHOO.COM>
In-Reply-To:   <755575.28860.qm@web32205.mail.mud.yahoo.com>
Content-Type:   text/plain; charset="us-ascii"

If you are interested, I will be happy to send you my contrast generator.

Using this tool, you specify the contrast in one of several ways. The macro fills out the relevant components of the contrast. Using my macro, all contrasts are always estimable. They may not be sensible, but they are always estimable.

Paul A. Thompson, Ph.D. Division of Biostatistics, Washington University School of Medicine 660 S. Euclid, St. Louis, MO 63110-1093 314-747-3793 paul@wubios.wustl.edu

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Dale McLerran Sent: Monday, May 19, 2008 12:39 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Contrasts, interactions, data step and where

--- Peter Flom <peterflomconsulting@MINDSPRING.COM> wrote:

> I am working on a paper for NESUG on PROC LOGISTIC. > > Mostly for PROC LOGISTIC, but also other PROCs, I am investigating > writing contrast statements for interactions. It's tricky, and there > is little guidance around, and it seems very easy to write the > contrast incorrectly. > > So I'm wondering if there is any reason not to do this differently, > using WHERE and possibly DATA step processing. > > That is, instead of trying to do something like > > PROC LOGISTIC data = dataset; > model depvar = indvar1 indvar2 indvar1*indvar2; > contrast XXXXXX; > run; > > use something like > > PROC LOGISTIC data = dataset; > model depvar = indvar1 indvar2 indvar1*indvar2; > where indvar1 = XXX and indvar2 = YYY; > run; > > of course, this is more limited, it might be hard to write it up for > complex contrasts like "is the average of these two different from > the average of those three" but, in my experience, people are not > usually interested in that sort of thing. > > Any thoughts? > > Peter > > Peter L. Flom, PhD > Statistical Consultant > www DOT peterflom DOT com >

Peter,

For the case where all predictor variables are categorical and have complete interaction specification to the order of the number of predictor variables, then the approach which you suggest will produce the same numerator for any specified contrast as you would obtain from complete data analysis. When this condition is met, a cell means model is being fit. Because you are fitting a cell means model, you cannot misspecify the mean by restricting the data to only the cell(s) necessary for the contrast of interest. However, the standard errors are likely to be poorly estimated.

Let's parse this a little bit. What does it mean to say that all predictor variables are categorical and have complete interaction specification to the order of the number of predictor variables? This might be best illustrated through example. The following code fragment with CLASS and MODEL statements exhibit the quality specified in the condition:

class A B C; model y = A | B | C;

Note that in this example, we have three predictor variables which are all categorical. The model specification can be expanded as

model y = A B C A*B A*C B*C A*B*C;

Every categorical variable main effect is specified in the model, every two-variable combination is specified, and the (only) three- variable combination is specified in the model. The CLASS and MODEL statements result in a model which predicts the cell mean for every cell of the three way combination of predictors.

On the other hand, the code fragment below does not meet the criteria outlined in the first paragraph:

class A B C model y = A | B | C @2;

The above model statement will include all of the main effects and all of the two-way interaction effects, but excludes the three-way interaction. Because the three-way interaction is excluded, we do not have a cell means model, which is a requirement for obtaining the correct point estimates of effects for your restricted data approach.

Now, how about the standard errors? For simplicity, let's assume that the response Y is normally distributed and that you have two two-level categorical predictor variables. Thus, the data can be summarized as:

A B A(1) A(2) |------------|------------| | | | B(1) | Ybar(1,1) | Ybar(1,2) | Ybar(1,.) | | | |------------|------------| | | | B(2) | Ybar(2,1) | Ybar(2,2) | Ybar(2,.) | | | |------------|------------| Ybar(.,1) Ybar(.,2) Ybar(.,.)

You might be interested in the contrast Ybar(1,1)-Ybar(1,2) in which case you could fit a model in which you would restrict the data to B=B(1) and then just fit a model in which you specify the A main effect. You will get the correct point estimates Ybar(1,1) and Ybar(1,2) that are required for this contrast. But what happens to the standard error when you do this? You will compute the residual error variance using only data where B=B(1). Because you fit a cell means model, the residual variance is simply the sum across all cells of within-cell sum of squares divided scaled by the appropriate degrees of freedom. When the complete data are employed, the numerator uses deviations from all four cells. When you restrict the data to the above contrast, you are using only the cells with Ybar(1,1) and Ybar(1,2) to compute your variance estimate.

Now, for the contrast which we specified above, we used half of the cells from our complete data table. Because we have a continuous response, there is essentially a zero percent chance that the variance of the data which are included in the contrast of interest (B=B(1)) is the same as the variance of the data which are excluded when we focus in on the specified contrast. There is a 50% chance that we will overstate the residual variance and a 50% chance that we will understate the residual variance.

I should concede that you will have a residual variance which has correct expectation. Thus, you will not bias your contrast toward acceptance or rejection just because you have restricted your data. But you will not obtain the same value for your contrast as the person who has employed the complete data and constructed a contrast statement that reflects the complete data design. I would argue that the person who used the complete data to estimate their residual variance would have the better test simply because their estimate of the variance is more stable.

To this point, I have not even discussed designs where there is anything other than a cell means model which is fit. But when the complete data model is not a cell means model, then restricting the data to only some subset will produce point estimates which do not match the point estimates obtained when the complete data are employed. I won't elaborate on this right now, but just raise the issue for you to consider.

Just to summarize, I would not want to be presented with contrast results which are based on restricting the data to some subset of the entire data set. In the best of circumstances (where your model is a cell means model and your contrast is a simple difference of cell means), you will have the correct point estimate for the numerator of your test but you will employ a poor estimate of the variance in the test statistic denominator. In other circumstances (where your model is not a simple cell means model), you will not even get the same point estimate for the numerator as would be obtained from complete data analysis.

Dale

--------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@NO_SPAMfhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 ---------------------------------------


Back to: Top of message | Previous page | Main SAS-L page