| Date: | Wed, 21 Jul 2004 14:12:43 -0400 |
| Reply-To: | sashole@bellsouth.net |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | "Paul M. Dorfman" <sashole@BELLSOUTH.NET> |
| Organization: | Sashole of Florida |
| Subject: | Re: Select k least elements of a variable in a dataset |
|
| In-Reply-To: | <7D2EA86E0DEB6F46994CDE7BA21B968101C64BF5@BANMLVEM02.e2k.ad.ge.com> |
| Content-Type: | text/plain; charset="us-ascii" |
Diwakar,
The algorithm somewhat depends on the value of K, since at K=1 it is trivial
:-), so I assume (1 << K << 100000) holds. Unfortunately, with the number of
total rows you have, the task is inherently time-consuming. Regardless of
the algorithm, it is impossible to select K smallest elements out of N
elements without at least one full pass through N, which means that you are
at least looking at reading 1.5 billion records.
At a single data set level, sifting through a variety of techniques fitting
the task, I do not see another way of ranking the variable in question
*while the data set is being generated* but to index the variable, then use
the index to extract the needed K records:
%let K = 100 ;
Data single_x (index = (var)) ;
array other [20] ; *<-satellite variables ;
do _n_ = 1 to 100000 ;
var = ranuni (1) ;
output ;
end ;
Run ;
Data K ;
set single_x ;
by var ;
if _n_ > &k then stop ;
Run ;
It ran about twice as fast overall (0.5 seconds vs 1.25 seconds) as the sort
approach:
Data single ; *<-no index here ;
array other [20] ; *<-satellite variables ;
do _n_ = 1 to 100000 ;
var = ranuni (1) ;
output ;
end ;
Run ;
Proc sort data = single ;
by var ;
Run ;
Data K ;
set single (obs = &k) ;
Run ;
The difference may be negligible, though, if the satellite variables
overburdening the sort are few. However, PROC RANK must be using much better
optimized ranking algorithm (it should be no surprise, I guess), for it was
four times faster, overall (0.21 seconds), than the SORT approach:
Proc rank data = sinlge out = K (where = (_n_ <= &k)) ;
ranks _n_ ;
var var ;
Run ;
_N_ *will* be included in K data set, but the very first data step reading K
will lose it.
Now if you have V9 available, it may come to rescue here, since it is
possible to accumulate values in an ordered hash table on the fly - which
means that the entire feat can be pulled in the Data step where SINGLE is
being created (if it is being created in a Data step, that is):
Data K ( drop = _: ) ;
array other [20] ;
dcl hash v (hashexp: 16, ordered: 'a') ;
dcl hiter vi ('v') ;
v.definekey ('var', '_n_') ;
v.definedata ('var') ;
do _j = 1 to dim (other) ;
v.definedata ( vname (other[_j]) ) ;
end ;
v.definedone () ;
do _n_ = 1 to 100000 ;
var = ranuni (1) ;
v.add () ;
end ;
_rc = vi.first() ;
do _n_ = 1 by 1 while ( _rc = 0 ) ;
if _n_ <= &k then output ;
_rc = vi.next() ;
end ;
Run ;
This code runs faster than SORT, overall, but slower than either indexing or
ranking. I would envisage, however, that if you had the data stacked
together with an ID variable indicating the virtual data set number, this
technique, applied to each ID by-group separately, could be fastest of all.
Kind regards,
----------------
Paul M. Dorfman
Jacksonville, FL
----------------
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> Behalf Of Sharma, Diwakar (Corporate)
> Sent: Wednesday, July 21, 2004 3:03 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Select k least elements of a variable in a dataset
>
> Hi all,
>
> We do a lot of simulation activity, after which the dataset
> generated are sorted and then the worst k values of a given
> variable are selected.
> Sort takes a lot of time, given the fact that every dataset
> has 100,000 rows and there are some 15000 such datasets.
>
> Is there any way with which the same can be done without
> sorting the dataset.
> Any pointers / algos / code would be really helpful.
> The dataset can be altered (row swapping ie)
>
>
> Thanks in advance
>
> Diwakar Sharma
>
|