LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2000, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 6 Jun 2000 13:03:31 +0200
Reply-To:     Jim Groeneveld <J.Groeneveld@ITGROUPS.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Jim Groeneveld <J.Groeneveld@ITGROUPS.COM>
Subject:      Re: De-Duplication
Comments: To: "sashole@mediaone.net" <sashole@mediaone.net>
Content-Type: text/plain; charset="iso-8859-1"

Paul,

Why do you make it that complicated? It can be done much more simple and 2.5 times faster:

data rnd (keep=rnd); do _n_=1 to 1e6; rnd = int(ranuni(1)*1e6) + 1e10; rec = int(ranuni(1)*1e6); output; end; run;

%let hs = 1000000; data nodup; array r(0:&hs) _temporary_; set rnd; _n_ = mod(rnd,&hs); if r(_n_) = rnd then delete; r(_n_) = rnd; run;

Instead of filling with and checking for RND the array may also be filled with and checked for some constant value, like 1.

Regards - Jim. -- Y. (Jim) Groeneveld, MSc IMRO TRAMARKO tel. +31 412 407 070 senior statistician, P.O. Box 1 fax. +31 412 407 080 head IT department 5350 AA BERGHEM IMRO TRAMARKO: a CRO J.Groeneveld@ITGroups.com the Netherlands in clinical research

My computer does not need me at all, but I can't do without it anymore.

> -----Original Message----- > From: Paul Dorfman [SMTP:paul_dorfman@HOTMAIL.COM] > Sent: Friday, June 02, 2000 11:06 PM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Re: De-Duplication > > Bolvan, > > It is amazing how deply rooted SAS habits are. Once I perceived the stem > 'dup', 'sort' and 'distinct' poped up in my head. Of course you can use > SORT > with NODUPKEY, having, prior to doing that, created a unique record ID for > RND that will help resurrect the original order. The 'standard' step > sequence usually looks like > > data v / view=v; > set rnd; > seq ++ 1; > run; > proc sort data=v out=nodup nodupkey equals; > by rnd; > run; > proc sort data=nodup out=nodup(drop=seq); > by seq; > run; > > However, for a second, let us try to forget the banal methods we have > learned using SAS for years and ask: Does this 'standard' SAS logic make > any > sense? And the answer is: Absolutely not! From the standpoint of the task > at > hand, we need not rearrange the records at all, for in the output, they > must > be in the same sequence as in the input. Yet we are performing the most > expensive operation in data processing - sorting - twice, just to be able > to > use NODUPKEY as its by-product. In comparison, the common sense (aka > programming) logic would by as plain as a brick: > > 1. Read the 'next' record. > 2. Look up a table for the key RND associated with the record. > 3. If the key is already there, the record is duplicate. Go to 1. > 4. Write the record out and store the key in the table. Go to 1. > > In the DATA step, it does not get any simpler, either. Let us have some > test > data first: > > 2 data rnd (keep=rnd); > 3 do _n_=1 to 1e6; > 4 rnd = int(ranuni(1)*1e6) + 1e10; > 5 rec = int(ranuni(1)*1e6); > 6 output; > 7 end; > 8 run; > NOTE: The data set WORK.RND has 1000000 observations and 1 variables. > NOTE: The DATA statement used 3.02 CPU seconds and 6199K. > > And now: > > %let hs = 2000003; > 9 data nodup; > 10 array r (0:&hs) _temporary_; > 11 set rnd; > 12 do _n_=mod(rnd,&hs) by -1 until (r(_n_) = .); > 13 if _n_ < 0 then _n_ = &hs; > 14 if r(_n_) = rnd then delete; > 15 end; > 16 r(_n_) = rnd; > 17 run; > NOTE: The data set WORK.NODUP has 632092 observations and 1 variables. > NOTE: The DATA statement used 3.12 CPU seconds and 21830K. > > But what is that weird 2000003 doing up above? It is just the first prime > number greater than 1e6*2. And why the factor 2? Because I am using a hash > table with linear probing, and it will not perform well if is less than > about 50% sparse. However, by adding a single line of code, it can be > transformed into a table with double hashing, thus cutting down on memory > consumption, for the double hashing table performs acceptably well even if > it is 80-90% full: > > %let hs = 1250003; > data nodup; > array r (0:&hs) _temporary_; > set rnd; > c = 1 + mod(rnd,%eval(&hs-2)); > do _n_=mod(rnd,&hs) by -c until (r(_n_) = .); > if _n_ < 0 then _n_ ++ &hs; > if r(_n_) = rnd then delete; > end; > r(_n_) = rnd; > run; > NOTE: The data set WORK.NODUP has 632092 observations and 2 variables. > NOTE: The DATA statement used 3.74 CPU seconds and 16204K. > > How does the 'standard' method posted in the beginning fare in comparison? > To make the long story short, the three steps run in 9.39 CPU seconds, > overall. But I did not write this opus to brag about the 5 CPU seconds > difference. Rather, I wanted to emphasize, and, in the spirit of Friday > brought by Bernard, amuse fellow SAS-Lers with the funny fact that quite > often, we resort to stream-of-the-consciosness solutions whilst a > straightforward stream-of-the-sense solution lies on the surface. > > Kind regards, > =================== > Paul M. Dorfman > Jacksonville, Fl > =================== > > > > > > > > >From: bolvandubina@NETSCAPE.NET > >Reply-To: bolvandubina@NETSCAPE.NET > >To: SAS-L@LISTSERV.UGA.EDU > >Subject: De-Duplication > >Date: Fri, 2 Jun 2000 15:32:06 EDT > > > >Hello: > > > >Could I please ask for a help with a problem. I have a SAS table RND with > 2 > >numeric columns RND and REC and 1 million observations. What I need to do > >is remove duplicates by RND. I know I could use SORT with NODUPKEY. But > the > >problem is that the order of REC must remain the same as in the original > >dataset, and SORT changes it. In other words, suppose I have in RND: > > > >RND REC > >3 1 > >2 1 > >3 2 > >4 1 > >1 1 > >3 3 > >4 2 > >1 2 > >4 3 > >2 2 > > > >The correct output will be: > > > >RND REC > >3 1 > >2 1 > >4 1 > >1 1 > > > >In actuality RND values are integer like above but may have upwards of 10 > >digits and REC column may have many repeating values itself. Thanks in > >advance for any input. > > > >Bolvan > > > >---------- > >Get your own FREE, personal Netscape Webmail account today at > >http://home.netscape.com/webmail/ > > ________________________________________________________________________ > Get Your Private, Free E-mail from MSN Hotmail at http://www.hotmail.com


Back to: Top of message | Previous page | Main SAS-L page