Date: Tue, 23 Mar 1999 11:01:52 -0800
Reply-To: "David L. Cassell" <cassell@MERCURY.COR.EPA.GOV>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: "David L. Cassell" <cassell@MERCURY.COR.EPA.GOV>
Organization: OAO Corp.
Subject: Re: Assistance with stratified sample
Content-Type: text/plain; charset=us-ascii
douglas_19@hotmail.com wrote:
> Hi,
> I'm hoping for some help in a sampling issue.
>
> I am trying to establish a stratified sample of 5000 males and 5000 females
> from a popultion database.
>
> I have written some code that will allocate a random ID to each SSN that is
> used, however I would like some assistance with stratification.
Let me make a suggestion first. Reconsider using this stratification
structure. This seems to be a lot of stratification, and you haven't
indicated any advantage to using any strata. If you read Cochran, you'll
see that the usual reason chosen for stratification is not listed there.
If you just want to report on these variables, don't stratify. You
can always post-stratify the data afterward.
Olsen and Urquhart presented a paper at ASA a few years ago showing
that if your misclassification error on your strata is even 20%, you
are doing worse than simple random sampling (in terms of error
variance). Can you guarantee that your strata are this accurate?
> the code that I use for allocation of ID is
> DATA SAMPLE;
> SET FAMQTR;
> ID= RANUNI(1);
> PROC SORT DATA=SAMPLE OUT=SORTED01;
> BY ID;
>
> I would like to stratify by two variables,values (x y z) and Environment
codes
> (A through to J).
>
> The ratio in percentages for the values are x=38% y=42% z= 20% and the
> environments are A=12 B=14 C=6 D=8 E=7 F=16 G=9 H=11 I = 9 J=8%.
Hmm, I suspect that with categories broken out this fine that your
misclassification rate may be worse than you expect. Especially when you
start classifying by 'environment' codes.
> I've been told that this can be done using LAG but I have no experience in
> this.
>
> The outcome that I'm after is 2 SAS datasets that contain the records for
each
> sex.
Are you needing to do your selection before or after separating by sex?
If before, then you have *three* classification variables.. although you
probably have considerably less error on the 'sex' classification than the
other two.
Bear in mind that value and environment may not be statistically independent,
so you can't assume that the product of proportions is the proportion for
the cross of the two variables.
David
--
David L. Cassell, OAO cassell@mail.cor.epa.gov
Senior computing specialist
mathematical statistician