Date: Tue, 6 Jun 2000 13:03:31 +0200
Reply-To: Jim Groeneveld <J.Groeneveld@ITGROUPS.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Jim Groeneveld <J.Groeneveld@ITGROUPS.COM>
Subject: Re: De-Duplication
Content-Type: text/plain; charset="iso-8859-1"
Paul,
Why do you make it that complicated? It can be done much more simple and 2.5
times faster:
data rnd (keep=rnd);
do _n_=1 to 1e6;
rnd = int(ranuni(1)*1e6) + 1e10;
rec = int(ranuni(1)*1e6);
output;
end;
run;
%let hs = 1000000;
data nodup;
array r(0:&hs) _temporary_;
set rnd;
_n_ = mod(rnd,&hs);
if r(_n_) = rnd then delete;
r(_n_) = rnd;
run;
Instead of filling with and checking for RND the array may also be filled
with and checked for some constant value, like 1.
Regards - Jim.
--
Y. (Jim) Groeneveld, MSc IMRO TRAMARKO tel. +31 412 407 070
senior statistician, P.O. Box 1 fax. +31 412 407 080
head IT department 5350 AA BERGHEM IMRO TRAMARKO: a CRO
J.Groeneveld@ITGroups.com the Netherlands in clinical research
My computer does not need me at all, but I can't do without it anymore.
> -----Original Message-----
> From: Paul Dorfman [SMTP:paul_dorfman@HOTMAIL.COM]
> Sent: Friday, June 02, 2000 11:06 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: De-Duplication
>
> Bolvan,
>
> It is amazing how deply rooted SAS habits are. Once I perceived the stem
> 'dup', 'sort' and 'distinct' poped up in my head. Of course you can use
> SORT
> with NODUPKEY, having, prior to doing that, created a unique record ID for
> RND that will help resurrect the original order. The 'standard' step
> sequence usually looks like
>
> data v / view=v;
> set rnd;
> seq ++ 1;
> run;
> proc sort data=v out=nodup nodupkey equals;
> by rnd;
> run;
> proc sort data=nodup out=nodup(drop=seq);
> by seq;
> run;
>
> However, for a second, let us try to forget the banal methods we have
> learned using SAS for years and ask: Does this 'standard' SAS logic make
> any
> sense? And the answer is: Absolutely not! From the standpoint of the task
> at
> hand, we need not rearrange the records at all, for in the output, they
> must
> be in the same sequence as in the input. Yet we are performing the most
> expensive operation in data processing - sorting - twice, just to be able
> to
> use NODUPKEY as its by-product. In comparison, the common sense (aka
> programming) logic would by as plain as a brick:
>
> 1. Read the 'next' record.
> 2. Look up a table for the key RND associated with the record.
> 3. If the key is already there, the record is duplicate. Go to 1.
> 4. Write the record out and store the key in the table. Go to 1.
>
> In the DATA step, it does not get any simpler, either. Let us have some
> test
> data first:
>
> 2 data rnd (keep=rnd);
> 3 do _n_=1 to 1e6;
> 4 rnd = int(ranuni(1)*1e6) + 1e10;
> 5 rec = int(ranuni(1)*1e6);
> 6 output;
> 7 end;
> 8 run;
> NOTE: The data set WORK.RND has 1000000 observations and 1 variables.
> NOTE: The DATA statement used 3.02 CPU seconds and 6199K.
>
> And now:
>
> %let hs = 2000003;
> 9 data nodup;
> 10 array r (0:&hs) _temporary_;
> 11 set rnd;
> 12 do _n_=mod(rnd,&hs) by -1 until (r(_n_) = .);
> 13 if _n_ < 0 then _n_ = &hs;
> 14 if r(_n_) = rnd then delete;
> 15 end;
> 16 r(_n_) = rnd;
> 17 run;
> NOTE: The data set WORK.NODUP has 632092 observations and 1 variables.
> NOTE: The DATA statement used 3.12 CPU seconds and 21830K.
>
> But what is that weird 2000003 doing up above? It is just the first prime
> number greater than 1e6*2. And why the factor 2? Because I am using a hash
> table with linear probing, and it will not perform well if is less than
> about 50% sparse. However, by adding a single line of code, it can be
> transformed into a table with double hashing, thus cutting down on memory
> consumption, for the double hashing table performs acceptably well even if
> it is 80-90% full:
>
> %let hs = 1250003;
> data nodup;
> array r (0:&hs) _temporary_;
> set rnd;
> c = 1 + mod(rnd,%eval(&hs-2));
> do _n_=mod(rnd,&hs) by -c until (r(_n_) = .);
> if _n_ < 0 then _n_ ++ &hs;
> if r(_n_) = rnd then delete;
> end;
> r(_n_) = rnd;
> run;
> NOTE: The data set WORK.NODUP has 632092 observations and 2 variables.
> NOTE: The DATA statement used 3.74 CPU seconds and 16204K.
>
> How does the 'standard' method posted in the beginning fare in comparison?
> To make the long story short, the three steps run in 9.39 CPU seconds,
> overall. But I did not write this opus to brag about the 5 CPU seconds
> difference. Rather, I wanted to emphasize, and, in the spirit of Friday
> brought by Bernard, amuse fellow SAS-Lers with the funny fact that quite
> often, we resort to stream-of-the-consciosness solutions whilst a
> straightforward stream-of-the-sense solution lies on the surface.
>
> Kind regards,
> ===================
> Paul M. Dorfman
> Jacksonville, Fl
> ===================
>
>
>
>
>
>
>
> >From: bolvandubina@NETSCAPE.NET
> >Reply-To: bolvandubina@NETSCAPE.NET
> >To: SAS-L@LISTSERV.UGA.EDU
> >Subject: De-Duplication
> >Date: Fri, 2 Jun 2000 15:32:06 EDT
> >
> >Hello:
> >
> >Could I please ask for a help with a problem. I have a SAS table RND with
> 2
> >numeric columns RND and REC and 1 million observations. What I need to do
> >is remove duplicates by RND. I know I could use SORT with NODUPKEY. But
> the
> >problem is that the order of REC must remain the same as in the original
> >dataset, and SORT changes it. In other words, suppose I have in RND:
> >
> >RND REC
> >3 1
> >2 1
> >3 2
> >4 1
> >1 1
> >3 3
> >4 2
> >1 2
> >4 3
> >2 2
> >
> >The correct output will be:
> >
> >RND REC
> >3 1
> >2 1
> >4 1
> >1 1
> >
> >In actuality RND values are integer like above but may have upwards of 10
> >digits and REC column may have many repeating values itself. Thanks in
> >advance for any input.
> >
> >Bolvan
> >
> >----------
> >Get your own FREE, personal Netscape Webmail account today at
> >http://home.netscape.com/webmail/
>
> ________________________________________________________________________
> Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com
|