**Date:** Wed, 9 Apr 2003 11:19:19 -0700
**Reply-To:** cassell.david@EPAMAIL.EPA.GOV
**Sender:** "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
**From:** "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV>
**Subject:** Re: Sampling Question
**Content-type:** text/plain; charset=us-ascii
Action Man <wollo_desse@HOTMAIL.COM> wrote:
> I have 7,000 records in my SAS file. Out of these records I want to
pick 500
> of them randomly, How do I do that using SAS.

Hamani Elmaache and John Ladds both gave the traditional, inefficient
answer of "create a random variable, sort on it, take the first n".
For small data sets, this doesn't use up much CPU or wallclock time,
but it simply is not necessary.

Paul Dorfman pointed out:
> You will no doubt get (or have already gotten) plenty of advice how to
do it
> using the "standard" K/N method, where K is the sample size and N is
the
> population size. It is based on reading all N records from the
population
> file. It is plenty sufficient and fast in your case, where N=7000 only
and
> K=500 is not a tiny fraction of N.

Dale McLerran has written a macro to address this for a small n and
large N, using POINT= to select out only the needed records. He has
published this on SAS-L, so it is in the archives.

Paul then presented a very nice bit of code to pull out a sample,
but ended up with simple random sampling *WITH* replacement, which
is what is called URS sampling in the PROC SURVEYSELECT docs.

Dale McLerran replied:
> The algorithm which you are proposing performs sampling with
> replacement: a record may be picked more than one time. Now,
> there are occasions when we do want to do sampling with
> replacement, but my guess is that this is not one of them. In
> order to guarantee sampling without replacement (records can
> only be selected once), one could use a hash, right?

As a matter of fact, this is what PROC SURVEYSELECT does under the
hood. For simple random sampling problems, it can use Floyd's
ordered hash-table algorithm, which is considered a very good choice
for large data sets. The standard references on this algorithm are:

Bentley, J.L. and Floyd, R. (1987), "A Sample of Brilliance,"
Communications of the Association for Computing Machinery, 30, 754-757.

Bentley, J.L. and Knuth, D. (1986), "Literate Programming,"
Communications of the Association for Computing Machinery, 29, 364-369.

So use PROC SURVEYSELECT. It is:
[1] fast
[2] efficient
[3] safe from programming errors (sorry, Paul! :-)
[4] simple
[5] easier to validate or unit-test

Just say:

proc surveyselect data=MyInData out=MySample sampsize=500;
id <list of vars to drag along>;
run;

Note that I did not even specify METHOD=SRS, since that is the default
when no SIZE variable is given. I also did not specify a SEED= option.
PROC SURVEYSELECT will generate a random seed and print it out in the
output, so you can always re-create the sample if need be.
Actually I recommend you select your own random seed, but I did it
this way solely for instructional purposes.

Now what could be simpler than this? You don't need to compute or
indicate the size of the population. You don't need to worry about
generating your own random seed. You don't need to worry about the
accuracy of algorithm or the chance of errors in your code.

PROC SURVEYSELECT:
"We do more samples before 9 am than most people do all day."

HTH,
David
--
David Cassell, CSC
Cassell.David@epa.gov
Senior computing specialist
mathematical statistician