**Date:** Thu, 20 Jun 2002 09:42:41 +0100
**Reply-To:** david.mcnulty@QUESTINTL.COM
**Sender:** "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
**From:** david.mcnulty@QUESTINTL.COM
**Subject:** Re: Untenable results from PROC GENMOD
**Content-type:** text/plain; charset=us-ascii
Good Morning Dale,

Thankyou for a very comprehensive reply. I can also use the template
bootstrapping data step for another problem that has been sitting on the
shelf.

Dave.

Dale McLerran
<stringplayer_2@ To: David Mcnulty/GB/QUES/ICI@ICI,
yahoo.com> SAS-L@LISTSERV.UGA.EDU
cc:
19/06/2002 18:44 Subject: Re: Untenable results from PROC GENMOD

David,

That does change the picture substantially! However, I do believe
that there is still be a problem with including subject as a
covariate. My guess is that there subjects who rate all 7 products
employed in the analysis as a success/failure under both conditions.
A logit cannot be computed for those subjects. I dummied up some
data which demonstrates the situation that you report.

data binary;
do phase=1 to 2;
beta1 = .75*(phase=1);
do product=1 to 7;
beta2 = .25*rannor(1234579);
do subject=1 to 29;
gamma = .5*rannor(1234579);
eta = gamma + beta1 + beta2;
p = exp(eta)/(1+exp(eta));
y = ranbin(1234579,1,p);
output;
end;
/* Subject 30 will have all failures */
subject=30;
gamma = -10;
eta = gamma + beta1 + beta2;
p = exp(eta)/(1+exp(eta));
y = ranbin(1234579,1,p);
output;
subject=31;
gamma = .5*rannor(1234579);
eta = gamma + beta1 + beta2;
p = exp(eta)/(1+exp(eta));
y = ranbin(1234579,1,p);
output;
end;
end;
run;

proc freq data=binary(where=(subject>=30));
tables y*subject;
run;

proc genmod data=binary;
class subject phase product;
model y = subject phase|product / dist=bin;
lsmeans phase|product;
run;

Because subject 30 has all failures, inclusion of the subject effect
in the model causes the subject 30 logit to be inestimable. When
the point estimate for subject 30 went above 25, the change in model
likelihood was sufficiently small for GENMOD to state that
convergence was achieved. At this point, the variance for subject
30 was very large, which adversely affects the variance of the
least squares means. In fact, the point estimate and variance for
subject 30 should both be set to infinity, but that really cannot
be done. Note that if you employed PROC LOGISTIC to fit the model,
you would have gotten a statement that quasi-complete separation
of the data points occurred and that model fit was questionable.

So, the question is what can you do at this point? Some might
suggest a GEE analysis. GEE's take into account the nonindependenc
of the responses among subjects who have multiple observations.
This is what you are attempting to do when you include subject as
a covariate. But GEE's do not require that you obtain a subject
effect estimate. The syntax below fits a GEE to the data generated
above.

proc genmod data=binary;
class subject phase product;
model y = phase|product / dist=bin;
repeated subject=subject;
lsmeans phase|product;
run;

GEE requires large samples for inference to be valid. Sample size
refers to the number of independent observations which is the number
of subjects employed in the analysis. For a very simple model,
N=31 would be sufficient to support a GEE. For a model in which
you estimate 14 parameters, N=31 is not sufficient.

Another approach which one might consider is a random effects logit
model. Random effects logits can be fit employing the procedure
NLMIXED or employing the GLIMMIX macro which ships with SAS.
However, the random effects logit conditions on subject, just as
you did in the analysis which you presented. Conditioning on
subject is what got you into trouble in the first place. So we
will have to forget the random effects (any conditional) model and
confine our attention to what are referred to as marginal models.

There is one marginal model which I could heartily recommend for
your present situation. The model does not rely on large sample
theory for its validity. Rather, it relies on randomization theory.
What I would urge you to consider is the use of a bootstrap
regression approach. In the bootstrap analysis, we randomly pick
with replacement Ni=31 subjects from the set of N=31 subjects. Some
subjects are selected several times and some subjects are not
selected for the bootstrap sample. Fit the logistic regression
to the bootstrap sample and output the statistics of interest. In
your case, you would output the LSMEANS and convert them to
probability estimates. Draw a new sample of 31 subjects and fit
the model again. Do this K=1000 times. Now, you can compute mean
and variance for each probability value arising from your model.
You can also compute a 95% confidence interval directly from the
1000*alpha/2 and 1001-1000*alpha/2 (25th and 976th) ordered
statistics.

The bootstrap does require approximately equal sample sizes for each
subject. In your case, this holds perfectly. The bootstrap does
not require large sample theory for its validity, so it is
preferable over the GEE. You might want to look at our paper
"A Comparison of Statistical Methods for Clustered Data Analysis
with Gaussian Error". Feng, Z., McLerran, D, and Grizzle, J.
Statistics in Medicine, 15, 1793-1806 (1996). While that paper
discusses statistical models for gaussian data, the results hold
for binary response as well. We performed the simulation to
examine the various statistical models in the binary case, but we
were letting a visiting faculty member take the lead on writing up
results for the binary case. The paper never got written.

You could fit a bootstrap model employing code something like the
following (note that I have not tested the datastep below):

data bootstrap;
set mydata end=lastrec;
by subject;
array response {31,16} _temporary_;
position = 8*(phase-1) + product; /* For phase coded 1,2 */
/* Use 8*phase for phase=0,1 */
response{subject,position} = correct;
if lastrec then do;
do bootstrap=1 to 1000;
do i=1 to 31;
subject = ceil(31*ranuni(seed)); /* Specify SEED */
do j=1 to 16;
correct = response{subject,j};
phase = 1 + (j>8);
product = j - 8*(phase-1);
output;
end;
end;
end;
end;
run;

ods listing close;
ods output lsmeans=lsmeans;
proc genmod data=bootstrap;
by bootstrap;
model correct = phase|product / dist=bin;
lsmeans phase|product;
run;

At this point, I am too lazy to figure out just what the LSMEANS
dataset will look like. But you will want to convert the LSMEANS
to probability estimates in a datastep and then run PROC UNIVARIATE
against the probability estimates, requesting the 2.5th and 97.5th
percentiles of the distributions along with mean and variance.

Dale

--- david.mcnulty@QUESTINTL.COM wrote:
> Hi Dale,
>
> My apologies I have inadvertently mislead you.
>
> Further Info
>
> All 31 subjects test 8 products under condition 1 (scoring 1=success
> 0=fail), has a break and re-tests all 8 products under condition 2.
> Conditions 1 and 2 cannot be randomised. Total observations = 31*8*2
> = 496
> (434 ignoring Prod_8).
>
> Dave.
>

=====
---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: dmclerra@fhcrc.org
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------

__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

IMPORTANT NOTICE:
This email is confidential, may be legally privileged, and is for the
intended recipient only. Access, disclosure, copying, distribution, or
reliance on any of it by anyone else is prohibited and may be a criminal
offence. Please delete if obtained in error and email confirmation to the
sender.