```Date: Sat, 4 Mar 2000 02:23:38 +0000 Reply-To: John Whittington Sender: "SAS(r) Discussion" From: John Whittington Subject: Re: Ordered Random Extract Comments: To: sashole@mediaone.net Comments: cc: dnordlund@AOL.COM In-Reply-To: <200003032047.UAA14949@vicar.netnames.net> Content-Type: text/plain; charset="us-ascii" At 20:46 03/03/00 GMT, Paul Dorfman wrote: >Try this: > >19 %let n = 1e10; >20 data _null_; >21 array r (500000) _temporary_; >22 s = &n / dim(r); >23 do i=1 to dim(r); >24 r(i) = (i-1)*s + ceil(ranuni(1)*s); >25 end; >26 run; >NOTE: The DATA statement used 0.02 CPU seconds and 7062K. > >The idea here is to split the entire [1:1e10] range into 5e5 equal pieces >and make the next selection from the next consecutive interval containing >&n/dim(r) numbers. By the nature of the algorithm, the selected items will >be unique and ascending. Intuitively, it should provide for a strictly equal >chance for any [1:1e10] number to be selected if &n is a multiple of dim(r). >If not, there should be some skewness pertaining to the choice from the last >interval, but I would not expect it to be sizable. Someone please correct me >if my hunch is out of whack (Dr. John? David?). Paul, Dan seems to have beaten me to it, and said most of what I would have said. As he points out, your code will produce a stratified random sample. This might well be OK for the John's purpose, but it will, 'on average', be a much more 'evenly spread' sample than would (again 'on average') be the case with a simple random sample. One of the illusions that many Lottery players are under is that a random sample ought to be fairly 'uniform' in its spread! Dan goes on to write: >My initial thought was to use a without-replacement sampling scheme. Something >like: > >%let total=1e10; >%let wanted=500000; > >n=0; >needed=&wanted; >remain=total; >do i=1 to &total until(needed eq 0); > if ranuni(-1) le needed/remain > then do; > n+1; > r[n]=i; > needed+(-1); > end; > remain+(-1); >end; > >This will do the job of a simple random sample, but it not very efficient. It >would have taken over 10 hours on my 133Mhz pc. Maybe someone else can produce >a more efficient simple random sample. On the other hand if time isn't a >problem, this works. Apart from a missing amersand (remain=&total) I agree - but, just like Dan, I am using a 133 Mhz PC, and so would find this more than a little tedious! As always in these situations, it's the 'without replacement' which is the pain. For 'with replacement' one could simply use something like: data _null_; array r (500000) _temporary_; do i=1 to dim(r); r(i) = ceil(ranuni(whatever)*1e10) ; end; run; Ironically, since we are sampling 500,000 from 1e10, the chance of any value being repeated is pretty tiny - so, in this particular case, the 'with replacement' solution stands a very high chance of effectively being a 'without replacement' one. However, that obviously is not a certainty. If "Paul's favourite array sorting routine" can be easily adapted to remove duplicates, then one might consider something like (untested!): data _null_ ; array s (600000) _temporary_ ; array r (500000) _temporary_ ; do i=1 to dim(s); s(i) = ceil(ranuni(whatever)*1e10) ; end; do i = 1 to dim(r) ; do until r(i) ne . ; r(i) = s(ranuni(whatever)*dim(s)) ; end ; run; Of course, if one had adequate resources to set up a 1e10 temporary array, the one might consider something like (again untested): data _null_; array r (500000) _temporary_; array s (1e10) _temporary_ ; do i=1 to dim(r); OK = 0 ; do until OK = 1 ; r(i) = ceil(ranuni(whatever)*1e10) ; if s(r(i)) = . then do ; OK = 1 ; s(r(i)) = 1 ; end ; end ; end; run; I think that exhausts my immediate thoughts! Kind Regards, John ---------------------------------------------------------------- Dr John Whittington, Voice: +44 (0) 1296 730225 Mediscience Services Fax: +44 (0) 1296 738893 Twyford Manor, Twyford, E-mail: John.W@mediscience.co.uk Buckingham MK18 4EL, UK mediscience@compuserve.com ---------------------------------------------------------------- ```

Back to: Top of message | Previous page | Main SAS-L page