Date: Fri, 10 Dec 1999 08:27:33 -0800
Reply-To: "Lund, Pete" <Peter.Lund@CFC.WA.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Lund, Pete" <Peter.Lund@CFC.WA.GOV>
Subject: Re: a random sample. I published 2 macro program ...
Content-Type: text/plain; charset="windows-1252"
Ian brings up a good point that's been mentioned a few times on SAS-L over
the years. The pitfalls of creating variables in macro code that may
collide with variables in the incoming dataset(s) [Ian's point #3]. In this
case, the variable X is not the only variable that causes problems: V, J, I
and NLOBS cannot be on the incoming dataset if you expect predictable
results. The convention of using variable names with leading underscores
(i.e., _X) that are dropped in the macro code can solve many of these
problems [Ian's point #2].
Also, this is a good example of the effects of a non-zero seed to RANUNI()
[Ian's point #1].
Thought this was timely as we'd just had some discussion of coding
standards.
Pete Lund
WA State Caseload Forecast Council
(360) 902-0086 voice
(360) 902-0084 fax
peter.lund@cfc.wa.gov
-----Original Message-----
From: WHITLOI1 [mailto:WHITLOI1@WESTAT.COM]
Sent: Friday, December 10, 1999 6:21 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: a random sample. I published 2 macro program ...
Subject: Re: a random sample. I published 2 macro program ...
Summary: Problems with the code.
Respondent: Ian Whitlock <whitloi1@westat.com>
Renaud Harduin <r.harduin@ABS-TECHNOLOGIES.COM> offered two programs on a
popular subject - drawing random samples. He wrote
> Go to the www.SAShelp.com web site, I published 2 macro program :
>
> %ECH_SPLE : simple random sample (optimized in I/O, MEM and CPU)
> with distinct observation ==> Efficency
> %ECH_ALEA : Make a stratified random sample but requires more I/O
> and CPU
I looked at the first program and found the following problems:
1) For any two "random" samples from a given data set generated
by this program, the larger sample will contain the smaller
sample. For example the code,
data w ; do s = 1 to 100 ; output ; end ; run ;
%ech_sple ( data = w , out = s10 , size = 10 )
%ech_sple ( data = w , out = s23 , size = 23 )
proc compare data = s10 compare = s23 ( obs = 10 ) ; run ;
produced a report with no differences found.
2) The variables I, J, and DSID are on the output sample.
3) The variable X cannot be on the input data set.
4) The last record can never be in the sample.
5) The probability of choosing the 0th obs (there isn't any)
is 1/sample_size.
6) The number of logical obs is referenced but the program can
produce incorrect result for every logically missing
observation.
7) Duplicate choices must be eliminated in a subsequent step.
8) On efficiency - a nonworking linear search was used.
I didn't look at the second macro.
The site itself is impressive although I did get a glimmer of why the
SAS Institute objects to sites using the SAS name. It is unfortunate
that the quality of the programs is not monitored. This does not mean
the other 93 tip/programs have the same quality, I didn't look at them.
I can go along with the SAS-L rational that discussion must be free
and open, hence code posted need not work. In this context the
reader has a clear warning. But I find it frightening, to see a
professional looking web site without any monitoring of the quality
of posted programs.
Ian Whitlock