LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (March 1999, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 23 Mar 1999 11:01:52 -0800
Reply-To:     "David L. Cassell" <cassell@MERCURY.COR.EPA.GOV>
Sender:       "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:         "David L. Cassell" <cassell@MERCURY.COR.EPA.GOV>
Organization: OAO Corp.
Subject:      Re: Assistance with stratified sample
Content-Type: text/plain; charset=us-ascii

douglas_19@hotmail.com wrote: > Hi, > I'm hoping for some help in a sampling issue. > > I am trying to establish a stratified sample of 5000 males and 5000 females > from a popultion database. > > I have written some code that will allocate a random ID to each SSN that is > used, however I would like some assistance with stratification.

Let me make a suggestion first. Reconsider using this stratification structure. This seems to be a lot of stratification, and you haven't indicated any advantage to using any strata. If you read Cochran, you'll see that the usual reason chosen for stratification is not listed there. If you just want to report on these variables, don't stratify. You can always post-stratify the data afterward.

Olsen and Urquhart presented a paper at ASA a few years ago showing that if your misclassification error on your strata is even 20%, you are doing worse than simple random sampling (in terms of error variance). Can you guarantee that your strata are this accurate?

> the code that I use for allocation of ID is > DATA SAMPLE; > SET FAMQTR; > ID= RANUNI(1); > PROC SORT DATA=SAMPLE OUT=SORTED01; > BY ID; > > I would like to stratify by two variables,values (x y z) and Environment codes > (A through to J). > > The ratio in percentages for the values are x=38% y=42% z= 20% and the > environments are A=12 B=14 C=6 D=8 E=7 F=16 G=9 H=11 I = 9 J=8%.

Hmm, I suspect that with categories broken out this fine that your misclassification rate may be worse than you expect. Especially when you start classifying by 'environment' codes.

> I've been told that this can be done using LAG but I have no experience in > this. > > The outcome that I'm after is 2 SAS datasets that contain the records for each > sex.

Are you needing to do your selection before or after separating by sex? If before, then you have *three* classification variables.. although you probably have considerably less error on the 'sex' classification than the other two.

Bear in mind that value and environment may not be statistically independent, so you can't assume that the product of proportions is the proportion for the cross of the two variables.

David -- David L. Cassell, OAO cassell@mail.cor.epa.gov Senior computing specialist mathematical statistician


Back to: Top of message | Previous page | Main SAS-L page