Date: Wed, 18 Oct 2006 23:33:12 -0700
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: Survey data analysis
In-Reply-To: <20061012235310.76348.qmail@web32205.mail.mud.yahoo.com>
Content-Type: text/plain; format=flowed
stringplayer_2@YAHOO.COM wrote:
>
>Hi all (and David in particular),
>
>Just when I think I have survived the summer from h***, I find
>myself stuck with an analysis problem where I simply don't know
>what I need to do. Following is a description of the data that
>I have.
Dale, it's good to hear from you again! I was getting kind of
worried, since you weren't answering private emails to your
stringplayer address.
Well, at least you weren't answering mine. :-) :-)
>Students in essentially all colleges and universities in a
>particular region of the country were surveyed about smoking
>behaviors before and after an intervention that took place in a
>randomly selected set of half of the schools. Within each school,
>surveys were administered to freshmen, sophomores, juniors, and
>seniors with different probabilities of selection. The probability
>of selection differed between baseline and follow-up.
>
>School size determined how many students received the survey.
>At most, 750 freshmen were selected at random to receive the
>survey in the baseline period. If there were fewer than 750
>freshmen in a particular school, then all freshmen received the
>survey. If there were more than 750 freshmen, then a random
>sample from the registrars list was selected to receive the survey.
>For sophomores, juniors, and seniors, the maximum number to receive
>the survey in each class was 200. Just as with the freshmen, if
>there were fewer than 200 in a class, all in that class received
>the survey.
>
>Baseline freshmen who responded to the survey were sent the
>follow-up survey which was administered two years later following
>an intervention is some schools. In addition, a random sample
>of up to 200 from each class received the final survey. Again,
>if the class size was less than 200, then all in the class
>received the survey.
>
>The survey administration rates are very high, especially in smaller
>schools. Of course, the number of surveys returned is another matter.
>The number of surveys returned ranges from about 25% to 50% of the
>surveys administered.
>
>This study then combines elements of a complex sampling design with
>an experimental design. I believe that the sample design takes
>precedence over the experimental design. However, when it comes
>to analysis of data from a survey sample, I am a pure novice. With
>that in mind, I have a few questions.
>
> 1) What exactly is the design here? We have clusters (schools)
> and within each school at each survey time point we select
> students in a particular class without replacement.
Okay, I agree here. I doubt the survey intrument went to every
possible school, so there may be cluster issues at your first stage,
and there may be non-response issues at stage 1 also. (Are the
schools not surveyed different in meaningful ways from those that
participated? That may affect your definition of your target
population.)
Let's just start with the first time point only. We have to do a couple
things:
[1] find out if all schools were surveyed;
[2] decide based on the subject-matter experts' opinions whether
to treat this as a random sampling process, or as a division into a
sampled/target population and an unsampled population;
[3] based on #2, decide how to adjust weights.
Now we move on to stage 2 and we sample students within schools.
You have what sounds like a stratified sample in each school, even if
the sample turns out to be a census some of the time.
> 2) How do I construct the weights when there is survey nonresponse?
> If there were no nonresponse, then weights would be calculated
> as the number of students in the school/class/time point
> combination divided by the number of surveys administered
> in the same combination, right? But when we have nonresponse,
Right. (Assuming equal selection probabilities.)
> my understanding is that the survey weights take a more
> complex form. We don't just use the number of returned surveys
> for a given school/class/time combination as the denominator
> when computing the survey weights. Is that correct?
This depends on how your Principal Investigator wants to treat the
non-response. My personal view is that in cases like this we have to
assume that the sampled population may be substantively different
from the non-response group. In that case, I typically try to get
samples from a small but random subset of the non-responders,
using whatever means are available (although sending them to Abu
Ghraib is usually considered a last resort. :-) so that we can make
some non-response bias adjustments. I'm usually way too far down
the line to get that. ("Are you nuts? We did that survey four years
ago, so it's way too late for that. What? So it took a while to get
the data put together and cleaned...") At that point, I usually
advocate to treat the non-response group as a separate unsampled
part of the overall population. We estimate the size of that
sub-population, we caveat the reports as not having access to that
portion of the population, and we do estimates on the sampled
sub-population only.
So, if you go this route, you leave the weights alone because you
are shrinking the 'target' population. The sum of the weights is
now your estimate of the size of the target population, and the
sum of weights of the non-responders gives you the size of that
sub-population.
The alternative is to pretend that the non-responders are exactly
like the responders, but just had a bad hairday or something and
couldn't come out of the bathroom to fill out the survey. In
that case, you end up having to adjust the survey weights upward.
Typically, it's done in a group-by-group fashion, where 'group' is
deliberately super-vague here because I usually try to aggregate to
the highest level that is reasonable (as decided by the subject-matter
experts and/or the survey design).
> 3) Based on the design that is specified for 1), what statements
> (and options?) are required for the SURVEYLOGISTIC procedure?
Well, the key point you need to have is that population totals,
CLUSTER variables, STRATA variables, etc. need to be based on stage 1
of the sample. The remaining variablility is done under the hood. But
the weights have to be computed for each stage of the sample,
adjusted for each stage, and then multiplied across stages to get a
weight that scales from student up to the region.
If you're going to be at PNWSUG in 12 days, we can talk about this
in more detail.
HTH,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
Find a local pizza place, music store, museum and more…then map the best
route! http://local.live.com?FORM=MGA001