LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (March 2000, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 6 Mar 2000 15:58:09 +0000
Reply-To:     tra <Tim.Auton@PROTHERICS.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
Comments:     To: John Whittington <John.W@MEDISCIENCE.CO.UK>
Comments:     cc: "Dorfman, Paul" <pdorfma@CITICORP.COM>
From:         tra <Tim.Auton@PROTHERICS.COM>
Organization: Proteus Molecular Design Ltd
Subject:      Re: Ordered Random Extract
Comments: To: SAS-L@LISTSERV.VT.EDU
Content-Type: text/plain; charset=us-ascii

Paul, John

I only saw this thread today, so forgive me if I am recovering old ground.

It seems to me that you want a way to generate the selection in order, item by item.

You can do this if you can calculate the probability distribution of (how many should I miss out before I take the next sample). This is quite tractable. Suppose you want to select r items out of n without replacement. In particular, we want to select r integers from 1:n There are C(n,r) = n!/(r!(n-r)!) distinct ways to do this. If we now stipulate that none of the first k items are selected, there are C(n-k, r) ways to do it. Define G(k; n, r) = C(n-k,r)/C(n, r) Clearly G is monotone decreasing with G(0) = 1 and G(n-r+1) = 0

This is how we can choose the next item to select. Pick a random number from U(0,1), call it z. Find the smallest largest k s.t. G(k; n,r) > z This means you should skip the next k possible items and choose the next one.

Here is some SAS code which implements this algorithm. It uses a binary search to find k.. This algorithm is very fast. To choose r out of n takes O(r log n) steps. I have tested it a bit, but not as much as if I were going to use it for something really important. On my less than state of the art PC I estimate it wopuld take 100 s to pick 5e5 from 1e10.

%let n = 10000000000; /* select from 1:&n */ %let r = 100000; /* select &r items */ %let seed = 12345; /* random number seed */ %macro lg(k, n, r); (lgamma((&n)-(&k)+1) - lgamma((&n)+1) - lgamma ((&n)-(&r)-(&k)+1) + lgamma ((&n)-(&r)+1)); %mend; data select; n = &n; r = &r; ilast = 0; nleft = n; rleft = r; do isel = 1 to r; z = ranuni(&seed); lz = log(z); lb = 0; ub = nleft - rleft + 1; /* going to pick k from interval [lb, ub) */ do while (ub -lb > 1); middle = int((lb+ub)/2); lg = %lg(middle, nleft, rleft); if lg >= lz then lb = middle; else ub = middle; end; select = ilast + lb + 1; ilast = select; nleft = n - ilast; rleft = r - isel; output; end; run;

Tim Auton

John Whittington wrote:

> At 20:46 03/03/00 GMT, Paul Dorfman wrote: > > >Try this: > > > >19 %let n = 1e10; > >20 data _null_; > >21 array r (500000) _temporary_; > >22 s = &n / dim(r); > >23 do i=1 to dim(r); > >24 r(i) = (i-1)*s + ceil(ranuni(1)*s); > >25 end; > >26 run; > >NOTE: The DATA statement used 0.02 CPU seconds and 7062K. > > > >The idea here is to split the entire [1:1e10] range into 5e5 equal pieces > >and make the next selection from the next consecutive interval containing > >&n/dim(r) numbers. By the nature of the algorithm, the selected items will > >be unique and ascending. Intuitively, it should provide for a strictly equal > >chance for any [1:1e10] number to be selected if &n is a multiple of dim(r). > >If not, there should be some skewness pertaining to the choice from the last > >interval, but I would not expect it to be sizable. Someone please correct me > >if my hunch is out of whack (Dr. John? David?). > > Paul, Dan seems to have beaten me to it, and said most of what I would have > said. As he points out, your code will produce a stratified random sample. > This might well be OK for the John's purpose, but it will, 'on average', > be a much more 'evenly spread' sample than would (again 'on average') be > the case with a simple random sample. One of the illusions that many > Lottery players are under is that a random sample ought to be fairly > 'uniform' in its spread! > > Dan goes on to write: > > >My initial thought was to use a without-replacement sampling scheme. > Something > >like: > > > >%let total=1e10; > >%let wanted=500000; > > > >n=0; > >needed=&wanted; > >remain=total; > >do i=1 to &total until(needed eq 0); > > if ranuni(-1) le needed/remain > > then do; > > n+1; > > r[n]=i; > > needed+(-1); > > end; > > remain+(-1); > >end; > > > >This will do the job of a simple random sample, but it not very efficient. > It > >would have taken over 10 hours on my 133Mhz pc. Maybe someone else can > produce > >a more efficient simple random sample. On the other hand if time isn't a > >problem, this works. > > Apart from a missing amersand (remain=&total) I agree - but, just like Dan, > I am using a 133 Mhz PC, and so would find this more than a little tedious! > > As always in these situations, it's the 'without replacement' which is the > pain. For 'with replacement' one could simply use something like: > > data _null_; > array r (500000) _temporary_; > do i=1 to dim(r); > r(i) = ceil(ranuni(whatever)*1e10) ; > end; > <Paul's favourite array sorting routine here!> > run; > > Ironically, since we are sampling 500,000 from 1e10, the chance of any > value being repeated is pretty tiny - so, in this particular case, the > 'with replacement' solution stands a very high chance of effectively being > a 'without replacement' one. However, that obviously is not a certainty. > > If "Paul's favourite array sorting routine" can be easily adapted to remove > duplicates, then one might consider something like (untested!): > > data _null_ ; > array s (600000) _temporary_ ; > array r (500000) _temporary_ ; > do i=1 to dim(s); > s(i) = ceil(ranuni(whatever)*1e10) ; > end; > <Paul's favourite array sorting + duplicate removing routine here!> > do i = 1 to dim(r) ; > do until r(i) ne . ; > r(i) = s(ranuni(whatever)*dim(s)) ; > end ; > run; > > Of course, if one had adequate resources to set up a 1e10 temporary array, > the one might consider something like (again untested): > > data _null_; > array r (500000) _temporary_; > array s (1e10) _temporary_ ; > do i=1 to dim(r); > OK = 0 ; > do until OK = 1 ; > r(i) = ceil(ranuni(whatever)*1e10) ; > if s(r(i)) = . then do ; > OK = 1 ; s(r(i)) = 1 ; > end ; > end ; > end; > <Paul's favourite array sorting routine here!> > run; > > I think that exhausts my immediate thoughts! > > Kind Regards, > > John > > ---------------------------------------------------------------- > Dr John Whittington, Voice: +44 (0) 1296 730225 > Mediscience Services Fax: +44 (0) 1296 738893 > Twyford Manor, Twyford, E-mail: John.W@mediscience.co.uk > Buckingham MK18 4EL, UK mediscience@compuserve.com > ----------------------------------------------------------------

-- T R Auton PhD MSc C.Math Head of Biomedical Statistics Proteus Molecular Design Ltd Beechfield House Lyme Green Business Park Macclesfield Cheshire SK11 0JL UK email: tim.auton@protherics.com


Back to: Top of message | Previous page | Main SAS-L page