Date: Mon, 6 Mar 2000 15:58:09 +0000
Reply-To: tra <Tim.Auton@PROTHERICS.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: tra <Tim.Auton@PROTHERICS.COM>
Organization: Proteus Molecular Design Ltd
Subject: Re: Ordered Random Extract
Content-Type: text/plain; charset=us-ascii
Paul, John
I only saw this thread today, so forgive me if I am recovering old ground.
It seems to me that you want a way to generate the selection in order, item by
item.
You can do this if you can calculate the probability distribution of (how many
should I miss out before I take the next sample).
This is quite tractable.
Suppose you want to select r items out of n without replacement. In particular,
we want to select r integers from 1:n
There are C(n,r) = n!/(r!(n-r)!) distinct ways to do this.
If we now stipulate that none of the first k items are selected, there are C(n-k,
r) ways to do it.
Define G(k; n, r) = C(n-k,r)/C(n, r)
Clearly G is monotone decreasing with G(0) = 1 and G(n-r+1) = 0
This is how we can choose the next item to select.
Pick a random number from U(0,1), call it z.
Find the smallest largest k s.t. G(k; n,r) > z
This means you should skip the next k possible items and choose the next one.
Here is some SAS code which implements this algorithm. It uses a binary search
to find k..
This algorithm is very fast. To choose r out of n takes O(r log n) steps.
I have tested it a bit, but not as much as if I were going to use it for
something really important.
On my less than state of the art PC I estimate it wopuld take 100 s to pick 5e5
from 1e10.
%let n = 10000000000; /* select from 1:&n */
%let r = 100000; /* select &r items */
%let seed = 12345; /* random number seed */
%macro lg(k, n, r);
(lgamma((&n)-(&k)+1) - lgamma((&n)+1) - lgamma ((&n)-(&r)-(&k)+1) + lgamma
((&n)-(&r)+1));
%mend;
data select;
n = &n;
r = &r;
ilast = 0;
nleft = n;
rleft = r;
do isel = 1 to r;
z = ranuni(&seed);
lz = log(z);
lb = 0;
ub = nleft - rleft + 1;
/* going to pick k from interval [lb, ub) */
do while (ub -lb > 1);
middle = int((lb+ub)/2);
lg = %lg(middle, nleft, rleft);
if lg >= lz then lb = middle;
else ub = middle;
end;
select = ilast + lb + 1;
ilast = select;
nleft = n - ilast;
rleft = r - isel;
output;
end;
run;
Tim Auton
John Whittington wrote:
> At 20:46 03/03/00 GMT, Paul Dorfman wrote:
>
> >Try this:
> >
> >19 %let n = 1e10;
> >20 data _null_;
> >21 array r (500000) _temporary_;
> >22 s = &n / dim(r);
> >23 do i=1 to dim(r);
> >24 r(i) = (i-1)*s + ceil(ranuni(1)*s);
> >25 end;
> >26 run;
> >NOTE: The DATA statement used 0.02 CPU seconds and 7062K.
> >
> >The idea here is to split the entire [1:1e10] range into 5e5 equal pieces
> >and make the next selection from the next consecutive interval containing
> >&n/dim(r) numbers. By the nature of the algorithm, the selected items will
> >be unique and ascending. Intuitively, it should provide for a strictly equal
> >chance for any [1:1e10] number to be selected if &n is a multiple of dim(r).
> >If not, there should be some skewness pertaining to the choice from the last
> >interval, but I would not expect it to be sizable. Someone please correct me
> >if my hunch is out of whack (Dr. John? David?).
>
> Paul, Dan seems to have beaten me to it, and said most of what I would have
> said. As he points out, your code will produce a stratified random sample.
> This might well be OK for the John's purpose, but it will, 'on average',
> be a much more 'evenly spread' sample than would (again 'on average') be
> the case with a simple random sample. One of the illusions that many
> Lottery players are under is that a random sample ought to be fairly
> 'uniform' in its spread!
>
> Dan goes on to write:
>
> >My initial thought was to use a without-replacement sampling scheme.
> Something
> >like:
> >
> >%let total=1e10;
> >%let wanted=500000;
> >
> >n=0;
> >needed=&wanted;
> >remain=total;
> >do i=1 to &total until(needed eq 0);
> > if ranuni(-1) le needed/remain
> > then do;
> > n+1;
> > r[n]=i;
> > needed+(-1);
> > end;
> > remain+(-1);
> >end;
> >
> >This will do the job of a simple random sample, but it not very efficient.
> It
> >would have taken over 10 hours on my 133Mhz pc. Maybe someone else can
> produce
> >a more efficient simple random sample. On the other hand if time isn't a
> >problem, this works.
>
> Apart from a missing amersand (remain=&total) I agree - but, just like Dan,
> I am using a 133 Mhz PC, and so would find this more than a little tedious!
>
> As always in these situations, it's the 'without replacement' which is the
> pain. For 'with replacement' one could simply use something like:
>
> data _null_;
> array r (500000) _temporary_;
> do i=1 to dim(r);
> r(i) = ceil(ranuni(whatever)*1e10) ;
> end;
> <Paul's favourite array sorting routine here!>
> run;
>
> Ironically, since we are sampling 500,000 from 1e10, the chance of any
> value being repeated is pretty tiny - so, in this particular case, the
> 'with replacement' solution stands a very high chance of effectively being
> a 'without replacement' one. However, that obviously is not a certainty.
>
> If "Paul's favourite array sorting routine" can be easily adapted to remove
> duplicates, then one might consider something like (untested!):
>
> data _null_ ;
> array s (600000) _temporary_ ;
> array r (500000) _temporary_ ;
> do i=1 to dim(s);
> s(i) = ceil(ranuni(whatever)*1e10) ;
> end;
> <Paul's favourite array sorting + duplicate removing routine here!>
> do i = 1 to dim(r) ;
> do until r(i) ne . ;
> r(i) = s(ranuni(whatever)*dim(s)) ;
> end ;
> run;
>
> Of course, if one had adequate resources to set up a 1e10 temporary array,
> the one might consider something like (again untested):
>
> data _null_;
> array r (500000) _temporary_;
> array s (1e10) _temporary_ ;
> do i=1 to dim(r);
> OK = 0 ;
> do until OK = 1 ;
> r(i) = ceil(ranuni(whatever)*1e10) ;
> if s(r(i)) = . then do ;
> OK = 1 ; s(r(i)) = 1 ;
> end ;
> end ;
> end;
> <Paul's favourite array sorting routine here!>
> run;
>
> I think that exhausts my immediate thoughts!
>
> Kind Regards,
>
> John
>
> ----------------------------------------------------------------
> Dr John Whittington, Voice: +44 (0) 1296 730225
> Mediscience Services Fax: +44 (0) 1296 738893
> Twyford Manor, Twyford, E-mail: John.W@mediscience.co.uk
> Buckingham MK18 4EL, UK mediscience@compuserve.com
> ----------------------------------------------------------------
--
T R Auton PhD MSc C.Math
Head of Biomedical Statistics
Proteus Molecular Design Ltd
Beechfield House
Lyme Green Business Park
Macclesfield
Cheshire SK11 0JL
UK
email: tim.auton@protherics.com
|