LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (March 2007, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Fri, 30 Mar 2007 12:14:13 -0400
Reply-To:   Ed Heaton <EdHeaton@WESTAT.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Ed Heaton <EdHeaton@WESTAT.COM>
Subject:   Re: dropping duplicates
Comments:   To: Jack Clark <JClark@chpdm.umbc.edu>, "Johnson, David" <David.Johnson@CBA.COM.AU>
In-Reply-To:   <456B52C41B724C41B96561D7AD283E7D0520F1@mail.chpdm.umbc.edu>
Content-Type:   text/plain; charset="us-ascii"

Some more methods of removing duplicates...

/*--------------------------------------------------------------------*/ data test; infile cards missover; input var1 $ var2 $ var3 $ var4 var5 $; cards; A A A 1 X A A A 2 R A A B 3 T A A B 4 M A A B 3 T ;

/* Let's make our dataset much larger so that we can compare performance. */ Data new ; Set test ; Do i=1 to 100000 ; Drop i ; Output ; End ; Run ;

/* Get a quoted, comma delimited list of the variables -- sans opening and closing quotes. */ Proc contents data=new out=vars noPrint ; Run ; Proc sql noPrint ; Select name into :allVars separated by '","' from vars ; Quit ;

Data _null_ ; /* Deduplicate when you don't care about the order of the output. */ Declare hash h( dataset:'new' ) ; h.defineKey( "&allVars" ) ; h.defineDone() ; h.output( dataset:'deduped1' ) ; Set new( obs=0 ) ; Run ;

Data deduped2 ; /* Deduplciate when you want to preserve the original order. */ If ( _N_ eq 1 ) then do ; Declare hash h() ; h.defineKey( "&allVars" ) ; h.defineDone() ; End ; Set new ; If not h.add() then output ; Run ;

Proc sort data=new out=deduped3 noDupKey ; /* Deduplicate when you want the output data sorted. */ By _all_ ; Run ; /*--------------------------------------------------------------------*/

The log: (I ran it twice; this is the second run.) ------------------------------------------------------------------------ 52 data test; 53 infile cards missover; 54 input var1 $ var2 $ var3 $ var4 var5 $; 55 cards;

NOTE: The data set WORK.TEST has 5 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.02 seconds

61 ; 62 /* Let's make our dataset much larger so that we can compare 63 performance. */ 64 Data new ; 65 Set test ; 66 Do i=1 to 100000 ; Drop i ; 67 Output ; 68 End ; 69 Run ;

NOTE: There were 5 observations read from the data set WORK.TEST. NOTE: The data set WORK.NEW has 500000 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 2.97 seconds cpu time 0.80 seconds

70 71 /* Get a quoted, comma delimited list of the variables sans 72 opening and closing quotes. */ 73 Proc contents data=new out=vars noPrint ; 74 Run ;

NOTE: The data set WORK.VARS has 5 observations and 40 variables. NOTE: PROCEDURE CONTENTS used (Total process time): real time 0.39 seconds cpu time 0.03 seconds

75 Proc sql noPrint ; 76 Select name into :allVars separated by '","' from vars ; 77 Quit ; NOTE: PROCEDURE SQL used (Total process time): real time 0.00 seconds cpu time 0.01 seconds

78 79 Data _null_ ; 80 /* Deduplicate when you don't care about the order of the output. */ 81 Declare hash h( dataset:'new' ) ; 82 h.defineKey( "&allVars" ) ; 83 h.defineDone() ; 84 h.output( dataset:'deduped1' ) ; 85 Set new( obs=0 ) ; 86 Run ;

NOTE: There were 500000 observations read from the data set WORK.NEW. NOTE: The data set WORK.DEDUPED1 has 4 observations and 5 variables. NOTE: There were 0 observations read from the data set WORK.NEW. NOTE: DATA statement used (Total process time): real time 0.77 seconds cpu time 0.64 seconds

87 88 Data deduped2 ; 89 /* Deduplciate when you want to preserve the original order. */ 90 If ( _N_ eq 1 ) then do ; 91 Declare hash h() ; 92 h.defineKey( "&allVars" ) ; 93 h.defineDone() ; 94 End ; 95 Set new ; 96 If not h.add() then output ; 97 Run ;

NOTE: There were 500000 observations read from the data set WORK.NEW. NOTE: The data set WORK.DEDUPED2 has 4 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 0.90 seconds cpu time 0.86 seconds

98 99 Proc sort data=new out=deduped3 noDupKey ; 100 /* Deduplicate when you want the output data sorted. */ 101 By _all_ ; 102 Run ;

NOTE: There were 500000 observations read from the data set WORK.NEW. NOTE: SAS sort was used. NOTE: 499996 observations with duplicate key values were deleted. NOTE: The data set WORK.DEDUPED3 has 4 observations and 5 variables. NOTE: PROCEDURE SORT used (Total process time): real time 1.54 seconds cpu time 1.51 seconds ------------------------------------------------------------------------

Ed

Edward Heaton, Senior Systems Analyst, Westat (An Employee-Owned Research Corporation), 1650 Research Boulevard, RW-4541, Rockville, MD 20850-3195 Voice: (301) 610-4818 Fax: (301) 294-3879 mailto:EdHeaton@Westat.com http://www.Westat.com

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Jack Clark Sent: Wednesday, March 28, 2007 7:40 AM To: 'Johnson, David'; SAS-L@LISTSERV.UGA.EDU Subject: RE: Re: dropping duplicates

David,

I agree with your example related to the NODUP option, but not with the NODUPKEY option. If you use the 5 sample observations you provided, and sort on only the first 3 columns with the NODUPKEY option, you will end up with only 2 observations in the output data set.

data test; infile cards missover; input var1 $ var2 $ var3 $ var4 var5 $; cards; A A A 1 X A A A 2 R A A B 3 T A A B 4 M A A B 3 T ; run;

proc sort data = test nodupkey out = testsrt; by var1 var2 var3; run;

proc print data = testsrt; run;

Obs var1 var2 var3 var4 var5

1 A A A 1 X 2 A A B 3 T

Jack Clark Research Analyst Center for Health Program Development and Management University of Maryland, Baltimore County

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Johnson, David Sent: Tuesday, March 27, 2007 10:31 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: dropping duplicates

Sorting by the entire data vector also overcomes an issue I was about to remark on from a previous post.

If you use NoDups or NoDupKey on the following data but select only the first three columns, you will still have a logical duplicate.

A A A 1 X A A A 2 R A A B 3 T A A B 4 M A A B 3 T

Since the order that is being asserted is on the first three columns, the fourth will not figure in the sequence. Therefore, the fifth record will still be the fifth, and the third will still be the third. Only by adding the fourth column will the fifth record be moved into the fourth table row, and be identified as a duplicate of the record in the third table row.

Sorting by _ALL_ columns will achieve the same effect, however, I should also caution that the hash table built to sequence the data will be very large and may have implications for the availability of work space, and the time to complete the sort. A better solution is to know and understand the data, and set up the By clause to use the first four columns for this data.

Kind regards

David

The Moving Finger writes; and, having writ, Moves on: nor all your Piety nor Wit Shall lure it back to cancel half a Line, Nor all your Tears wash out a Word of it

Omar Khayyam (trans: Edward Fitzgerald)

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Paula Sims Sent: Wednesday, 28 March 2007 11:16 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: dropping duplicates

You have received many good suggestions regarding NODUPKEY but each required listing the variables. What I found (which many probably know) is that if you don't want to list all the variables 'cuz there are so many, the following code works:

Proc sort data = first nodupkey; by _all_; run;

Also works for _numeric_ and _character_. By doing this you're guaranteeing no dup records and order (for this purpose) doesn't matter.

Paula

ram.sasha@gmail.com wrote: > I have a dataset with duplicates across 4 variables. > have: > id var1 var2 var3 > 1 25 18 11 > 1 25 18 11 > 1 14 16 12 > 2 21 14 11 > 2 21 14 11 > 3 18 19 12 > 3 15 11 13 > 4 11 18 12 > 4 11 18 12 > > I want to drop the duplicates which are identical in all the 4 > variables. > > want: > id var1 var2 var3 > 1 25 18 11 > 1 14 16 12 > 2 21 14 11 > 3 18 19 12 > 3 15 11 13 > 4 11 18 12 > > Thanks >

************** IMPORTANT MESSAGE ***************************** This e-mail message is intended only for the addressee(s) and contains information which may be confidential. If you are not the intended recipient please advise the sender by return email, do not use or disclose the contents, and delete the message and any attachments from your system. Unless specifically indicated, this email does not constitute formal advice or commitment by the sender or the Commonwealth Bank of Australia (ABN 48 123 123 124) or its subsidiaries. We can be contacted through our web site: commbank.com.au. If you no longer wish to receive commercial electronic messages from us, please reply to this e-mail by typing Unsubscribe in the subject line. **************************************************************


Back to: Top of message | Previous page | Main SAS-L page