```Date: Mon, 6 Mar 2000 15:58:09 +0000 Reply-To: tra Sender: "SAS(r) Discussion" Comments: To: John Whittington Comments: cc: "Dorfman, Paul" From: tra Organization: Proteus Molecular Design Ltd Subject: Re: Ordered Random Extract Comments: To: SAS-L@LISTSERV.VT.EDU Content-Type: text/plain; charset=us-ascii Paul, John I only saw this thread today, so forgive me if I am recovering old ground. It seems to me that you want a way to generate the selection in order, item by item. You can do this if you can calculate the probability distribution of (how many should I miss out before I take the next sample). This is quite tractable. Suppose you want to select r items out of n without replacement. In particular, we want to select r integers from 1:n There are C(n,r) = n!/(r!(n-r)!) distinct ways to do this. If we now stipulate that none of the first k items are selected, there are C(n-k, r) ways to do it. Define G(k; n, r) = C(n-k,r)/C(n, r) Clearly G is monotone decreasing with G(0) = 1 and G(n-r+1) = 0 This is how we can choose the next item to select. Pick a random number from U(0,1), call it z. Find the smallest largest k s.t. G(k; n,r) > z This means you should skip the next k possible items and choose the next one. Here is some SAS code which implements this algorithm. It uses a binary search to find k.. This algorithm is very fast. To choose r out of n takes O(r log n) steps. I have tested it a bit, but not as much as if I were going to use it for something really important. On my less than state of the art PC I estimate it wopuld take 100 s to pick 5e5 from 1e10. %let n = 10000000000; /* select from 1:&n */ %let r = 100000; /* select &r items */ %let seed = 12345; /* random number seed */ %macro lg(k, n, r); (lgamma((&n)-(&k)+1) - lgamma((&n)+1) - lgamma ((&n)-(&r)-(&k)+1) + lgamma ((&n)-(&r)+1)); %mend; data select; n = &n; r = &r; ilast = 0; nleft = n; rleft = r; do isel = 1 to r; z = ranuni(&seed); lz = log(z); lb = 0; ub = nleft - rleft + 1; /* going to pick k from interval [lb, ub) */ do while (ub -lb > 1); middle = int((lb+ub)/2); lg = %lg(middle, nleft, rleft); if lg >= lz then lb = middle; else ub = middle; end; select = ilast + lb + 1; ilast = select; nleft = n - ilast; rleft = r - isel; output; end; run; Tim Auton John Whittington wrote: > At 20:46 03/03/00 GMT, Paul Dorfman wrote: > > >Try this: > > > >19 %let n = 1e10; > >20 data _null_; > >21 array r (500000) _temporary_; > >22 s = &n / dim(r); > >23 do i=1 to dim(r); > >24 r(i) = (i-1)*s + ceil(ranuni(1)*s); > >25 end; > >26 run; > >NOTE: The DATA statement used 0.02 CPU seconds and 7062K. > > > >The idea here is to split the entire [1:1e10] range into 5e5 equal pieces > >and make the next selection from the next consecutive interval containing > >&n/dim(r) numbers. By the nature of the algorithm, the selected items will > >be unique and ascending. Intuitively, it should provide for a strictly equal > >chance for any [1:1e10] number to be selected if &n is a multiple of dim(r). > >If not, there should be some skewness pertaining to the choice from the last > >interval, but I would not expect it to be sizable. Someone please correct me > >if my hunch is out of whack (Dr. John? David?). > > Paul, Dan seems to have beaten me to it, and said most of what I would have > said. As he points out, your code will produce a stratified random sample. > This might well be OK for the John's purpose, but it will, 'on average', > be a much more 'evenly spread' sample than would (again 'on average') be > the case with a simple random sample. One of the illusions that many > Lottery players are under is that a random sample ought to be fairly > 'uniform' in its spread! > > Dan goes on to write: > > >My initial thought was to use a without-replacement sampling scheme. > Something > >like: > > > >%let total=1e10; > >%let wanted=500000; > > > >n=0; > >needed=&wanted; > >remain=total; > >do i=1 to &total until(needed eq 0); > > if ranuni(-1) le needed/remain > > then do; > > n+1; > > r[n]=i; > > needed+(-1); > > end; > > remain+(-1); > >end; > > > >This will do the job of a simple random sample, but it not very efficient. > It > >would have taken over 10 hours on my 133Mhz pc. Maybe someone else can > produce > >a more efficient simple random sample. On the other hand if time isn't a > >problem, this works. > > Apart from a missing amersand (remain=&total) I agree - but, just like Dan, > I am using a 133 Mhz PC, and so would find this more than a little tedious! > > As always in these situations, it's the 'without replacement' which is the > pain. For 'with replacement' one could simply use something like: > > data _null_; > array r (500000) _temporary_; > do i=1 to dim(r); > r(i) = ceil(ranuni(whatever)*1e10) ; > end; > > run; > > Ironically, since we are sampling 500,000 from 1e10, the chance of any > value being repeated is pretty tiny - so, in this particular case, the > 'with replacement' solution stands a very high chance of effectively being > a 'without replacement' one. However, that obviously is not a certainty. > > If "Paul's favourite array sorting routine" can be easily adapted to remove > duplicates, then one might consider something like (untested!): > > data _null_ ; > array s (600000) _temporary_ ; > array r (500000) _temporary_ ; > do i=1 to dim(s); > s(i) = ceil(ranuni(whatever)*1e10) ; > end; > > do i = 1 to dim(r) ; > do until r(i) ne . ; > r(i) = s(ranuni(whatever)*dim(s)) ; > end ; > run; > > Of course, if one had adequate resources to set up a 1e10 temporary array, > the one might consider something like (again untested): > > data _null_; > array r (500000) _temporary_; > array s (1e10) _temporary_ ; > do i=1 to dim(r); > OK = 0 ; > do until OK = 1 ; > r(i) = ceil(ranuni(whatever)*1e10) ; > if s(r(i)) = . then do ; > OK = 1 ; s(r(i)) = 1 ; > end ; > end ; > end; > > run; > > I think that exhausts my immediate thoughts! > > Kind Regards, > > John > > ---------------------------------------------------------------- > Dr John Whittington, Voice: +44 (0) 1296 730225 > Mediscience Services Fax: +44 (0) 1296 738893 > Twyford Manor, Twyford, E-mail: John.W@mediscience.co.uk > Buckingham MK18 4EL, UK mediscience@compuserve.com > ---------------------------------------------------------------- -- T R Auton PhD MSc C.Math Head of Biomedical Statistics Proteus Molecular Design Ltd Beechfield House Lyme Green Business Park Macclesfield Cheshire SK11 0JL UK email: tim.auton@protherics.com ```

Back to: Top of message | Previous page | Main SAS-L page