LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (April 2003, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 9 Apr 2003 11:19:19 -0700
Reply-To:     cassell.david@EPAMAIL.EPA.GOV
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV>
Subject:      Re: Sampling Question
Content-type: text/plain; charset=us-ascii

Action Man <wollo_desse@HOTMAIL.COM> wrote: > I have 7,000 records in my SAS file. Out of these records I want to pick 500 > of them randomly, How do I do that using SAS.

Hamani Elmaache and John Ladds both gave the traditional, inefficient answer of "create a random variable, sort on it, take the first n". For small data sets, this doesn't use up much CPU or wallclock time, but it simply is not necessary.

Paul Dorfman pointed out: > You will no doubt get (or have already gotten) plenty of advice how to do it > using the "standard" K/N method, where K is the sample size and N is the > population size. It is based on reading all N records from the population > file. It is plenty sufficient and fast in your case, where N=7000 only and > K=500 is not a tiny fraction of N.

Dale McLerran has written a macro to address this for a small n and large N, using POINT= to select out only the needed records. He has published this on SAS-L, so it is in the archives.

Paul then presented a very nice bit of code to pull out a sample, but ended up with simple random sampling *WITH* replacement, which is what is called URS sampling in the PROC SURVEYSELECT docs.

Dale McLerran replied: > The algorithm which you are proposing performs sampling with > replacement: a record may be picked more than one time. Now, > there are occasions when we do want to do sampling with > replacement, but my guess is that this is not one of them. In > order to guarantee sampling without replacement (records can > only be selected once), one could use a hash, right?

As a matter of fact, this is what PROC SURVEYSELECT does under the hood. For simple random sampling problems, it can use Floyd's ordered hash-table algorithm, which is considered a very good choice for large data sets. The standard references on this algorithm are:

Bentley, J.L. and Floyd, R. (1987), "A Sample of Brilliance," Communications of the Association for Computing Machinery, 30, 754-757.

Bentley, J.L. and Knuth, D. (1986), "Literate Programming," Communications of the Association for Computing Machinery, 29, 364-369.

So use PROC SURVEYSELECT. It is: [1] fast [2] efficient [3] safe from programming errors (sorry, Paul! :-) [4] simple [5] easier to validate or unit-test

Just say:

proc surveyselect data=MyInData out=MySample sampsize=500; id <list of vars to drag along>; run;

Note that I did not even specify METHOD=SRS, since that is the default when no SIZE variable is given. I also did not specify a SEED= option. PROC SURVEYSELECT will generate a random seed and print it out in the output, so you can always re-create the sample if need be. Actually I recommend you select your own random seed, but I did it this way solely for instructional purposes.

Now what could be simpler than this? You don't need to compute or indicate the size of the population. You don't need to worry about generating your own random seed. You don't need to worry about the accuracy of algorithm or the chance of errors in your code.

PROC SURVEYSELECT: "We do more samples before 9 am than most people do all day."

HTH, David -- David Cassell, CSC Senior computing specialist mathematical statistician

Back to: Top of message | Previous page | Main SAS-L page