Date: Fri, 3 Mar 2000 20:46:28 GMT
Reply-To: sashole@mediaone.net
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Paul Dorfman <paul_dorfman@HOTMAIL.COM>
Subject: Re: Ordered Random Extract
Content-Type: text/plain; format=flowed
John,
Try this:
19 %let n = 1e10;
20 data _null_;
21 array r (500000) _temporary_;
22 s = &n / dim(r);
23 do i=1 to dim(r);
24 r(i) = (i-1)*s + ceil(ranuni(1)*s);
25 end;
26 run;
NOTE: The DATA statement used 0.02 CPU seconds and 7062K.
The idea here is to split the entire [1:1e10] range into 5e5 equal pieces
and make the next selection from the next consecutive interval containing
&n/dim(r) numbers. By the nature of the algorithm, the selected items will
be unique and ascending. Intuitively, it should provide for a strictly equal
chance for any [1:1e10] number to be selected if &n is a multiple of dim(r).
If not, there should be some skewness pertaining to the choice from the last
interval, but I would not expect it to be sizable. Someone please correct me
if my hunch is out of whack (Dr. John? David?). To get a real feel of what
this process spits out, let us see how it will select 50 items out of
1000000 (same proportion):
85 %let n = 1e6;
86 data _null_;
87 array r (1:50) _temporary_;
88 s = &n / dim(r);
89 do i=1 to dim(r);
90 r(i) = (i-1)*s + ceil(ranuni(1)*s);
91 file log ls=40;
92 put r(i) z6. @ +1;
93 end;
94 run;
003700 039402 047997 065188 098433
119386 130860 150634 160996 181332
216387 230478 257068 261344 299141
305944 325453 353799 379536 384531
413765 428256 451172 465745 489516
516900 532691 551808 571652 587541
614568 630133 658625 678583 691794
705945 727821 749449 773591 783362
803331 837423 845976 878693 898010
911376 920991 942712 970227 988665
NOTE: The DATA statement used 0.01 CPU seconds and 3248K.
BTW, the same principle can be apparently applied to the selection from a
SAS dataset using POINT= to eliminate the necessity of reading the entire
darn thing and subsequent sorting. Say,
95 %let n = 1e8;
96 %let k = 5e3;
97 data big; do v=1 to &n; output; end; run;
NOTE: The data set WORK.BIG has 100000000 observations and 1 variables.
NOTE: The DATA statement used 51.23 CPU seconds and 3600K.
98 data sample;
99 s = n / &k;
100 do i=1 to &k;
101 r = (i-1)*s + ceil(ranuni(1)*s);
102 set big point=r nobs=n;
103 output;
104 end;
105 stop;
106 run;
NOTE: The data set WORK.SAMPLE has 5000 observations and 3 variables.
NOTE: The DATA statement used 0.39 CPU seconds and 3767K.
HTH.
Kind regards,
=======================
Paul M. Dorfman
Jacksonville, FL
=======================
>From: John Megargee <johnmegargee@NETSCAPE.NET>
>Somewhere in the middle of a long datastep I've inherited I need to
>populate a
>huge temp array like
>
>array r(500000) _temporary_;
>
>with unique positive integers chosen randomly from 1 to 1e10 without
>replacement. It wouldn't be a problem if I didn't need the populated array
>to
>be in ascedning order. Putting it another way, I want the elements of the
>array to retain their relative positions in the set [1:1e10] they've been
>selected from. Any ideas about doing it efficiently (hopefully without
>leaving
>the step)? I'll appreciate them all.
>
>TIA, John
>
>____________________________________________________________________
>Get your own FREE, personal Netscape WebMail account today at
>http://webmail.netscape.com.
______________________________________________________
Get Your Private, Free Email at http://www.hotmail.com