Date: Wed, 21 Jul 2004 14:12:43 -0400 sashole@bellsouth.net "SAS(r) Discussion" "Paul M. Dorfman" Sashole of Florida Re: Select k least elements of a variable in a dataset To: "Sharma, Diwakar (Corporate)" <7D2EA86E0DEB6F46994CDE7BA21B968101C64BF5@BANMLVEM02.e2k.ad.ge.com> text/plain; charset="us-ascii"

Diwakar,

The algorithm somewhat depends on the value of K, since at K=1 it is trivial :-), so I assume (1 << K << 100000) holds. Unfortunately, with the number of total rows you have, the task is inherently time-consuming. Regardless of the algorithm, it is impossible to select K smallest elements out of N elements without at least one full pass through N, which means that you are at least looking at reading 1.5 billion records.

At a single data set level, sifting through a variety of techniques fitting the task, I do not see another way of ranking the variable in question *while the data set is being generated* but to index the variable, then use the index to extract the needed K records:

%let K = 100 ;

Data single_x (index = (var)) ; array other [20] ; *<-satellite variables ; do _n_ = 1 to 100000 ; var = ranuni (1) ; output ; end ; Run ; Data K ; set single_x ; by var ; if _n_ > &k then stop ; Run ;

It ran about twice as fast overall (0.5 seconds vs 1.25 seconds) as the sort approach:

Data single ; *<-no index here ; array other [20] ; *<-satellite variables ; do _n_ = 1 to 100000 ; var = ranuni (1) ; output ; end ; Run ; Proc sort data = single ; by var ; Run ; Data K ; set single (obs = &k) ; Run ;

The difference may be negligible, though, if the satellite variables overburdening the sort are few. However, PROC RANK must be using much better optimized ranking algorithm (it should be no surprise, I guess), for it was four times faster, overall (0.21 seconds), than the SORT approach:

Proc rank data = sinlge out = K (where = (_n_ <= &k)) ; ranks _n_ ; var var ; Run ;

_N_ *will* be included in K data set, but the very first data step reading K will lose it.

Now if you have V9 available, it may come to rescue here, since it is possible to accumulate values in an ordered hash table on the fly - which means that the entire feat can be pulled in the Data step where SINGLE is being created (if it is being created in a Data step, that is):

Data K ( drop = _: ) ; array other [20] ;

dcl hash v (hashexp: 16, ordered: 'a') ; dcl hiter vi ('v') ; v.definekey ('var', '_n_') ; v.definedata ('var') ; do _j = 1 to dim (other) ; v.definedata ( vname (other[_j]) ) ; end ; v.definedone () ;

do _n_ = 1 to 100000 ; var = ranuni (1) ; v.add () ; end ;

_rc = vi.first() ; do _n_ = 1 by 1 while ( _rc = 0 ) ; if _n_ <= &k then output ; _rc = vi.next() ; end ; Run ;

This code runs faster than SORT, overall, but slower than either indexing or ranking. I would envisage, however, that if you had the data stacked together with an ID variable indicating the virtual data set number, this technique, applied to each ID by-group separately, could be fastest of all.

Kind regards, ---------------- Paul M. Dorfman Jacksonville, FL ----------------

> -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On > Behalf Of Sharma, Diwakar (Corporate) > Sent: Wednesday, July 21, 2004 3:03 AM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Select k least elements of a variable in a dataset > > Hi all, > > We do a lot of simulation activity, after which the dataset > generated are sorted and then the worst k values of a given > variable are selected. > Sort takes a lot of time, given the fact that every dataset > has 100,000 rows and there are some 15000 such datasets. > > Is there any way with which the same can be done without > sorting the dataset. > Any pointers / algos / code would be really helpful. > The dataset can be altered (row swapping ie) > > > Thanks in advance > > Diwakar Sharma >

Back to: Top of message | Previous page | Main SAS-L page