| Date: | Fri, 30 Mar 2007 12:14:13 -0400 |
| Reply-To: | Ed Heaton <EdHeaton@WESTAT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Ed Heaton <EdHeaton@WESTAT.COM> |
| Subject: | Re: dropping duplicates |
|
| In-Reply-To: | <456B52C41B724C41B96561D7AD283E7D0520F1@mail.chpdm.umbc.edu> |
| Content-Type: | text/plain; charset="us-ascii" |
Some more methods of removing duplicates...
/*--------------------------------------------------------------------*/
data test;
infile cards missover;
input var1 $ var2 $ var3 $ var4 var5 $;
cards;
A A A 1 X
A A A 2 R
A A B 3 T
A A B 4 M
A A B 3 T
;
/* Let's make our dataset much larger so that we can compare
performance. */
Data new ;
Set test ;
Do i=1 to 100000 ; Drop i ;
Output ;
End ;
Run ;
/* Get a quoted, comma delimited list of the variables -- sans
opening and closing quotes. */
Proc contents data=new out=vars noPrint ;
Run ;
Proc sql noPrint ;
Select name into :allVars separated by '","' from vars ;
Quit ;
Data _null_ ;
/* Deduplicate when you don't care about the order of the output. */
Declare hash h( dataset:'new' ) ;
h.defineKey( "&allVars" ) ;
h.defineDone() ;
h.output( dataset:'deduped1' ) ;
Set new( obs=0 ) ;
Run ;
Data deduped2 ;
/* Deduplciate when you want to preserve the original order. */
If ( _N_ eq 1 ) then do ;
Declare hash h() ;
h.defineKey( "&allVars" ) ;
h.defineDone() ;
End ;
Set new ;
If not h.add() then output ;
Run ;
Proc sort data=new out=deduped3 noDupKey ;
/* Deduplicate when you want the output data sorted. */
By _all_ ;
Run ;
/*--------------------------------------------------------------------*/
The log: (I ran it twice; this is the second run.)
------------------------------------------------------------------------
52 data test;
53 infile cards missover;
54 input var1 $ var2 $ var3 $ var4 var5 $;
55 cards;
NOTE: The data set WORK.TEST has 5 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
cpu time 0.02 seconds
61 ;
62 /* Let's make our dataset much larger so that we can compare
63 performance. */
64 Data new ;
65 Set test ;
66 Do i=1 to 100000 ; Drop i ;
67 Output ;
68 End ;
69 Run ;
NOTE: There were 5 observations read from the data set WORK.TEST.
NOTE: The data set WORK.NEW has 500000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 2.97 seconds
cpu time 0.80 seconds
70
71 /* Get a quoted, comma delimited list of the variables sans
72 opening and closing quotes. */
73 Proc contents data=new out=vars noPrint ;
74 Run ;
NOTE: The data set WORK.VARS has 5 observations and 40 variables.
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.39 seconds
cpu time 0.03 seconds
75 Proc sql noPrint ;
76 Select name into :allVars separated by '","' from vars ;
77 Quit ;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.00 seconds
cpu time 0.01 seconds
78
79 Data _null_ ;
80 /* Deduplicate when you don't care about the order of the output.
*/
81 Declare hash h( dataset:'new' ) ;
82 h.defineKey( "&allVars" ) ;
83 h.defineDone() ;
84 h.output( dataset:'deduped1' ) ;
85 Set new( obs=0 ) ;
86 Run ;
NOTE: There were 500000 observations read from the data set WORK.NEW.
NOTE: The data set WORK.DEDUPED1 has 4 observations and 5 variables.
NOTE: There were 0 observations read from the data set WORK.NEW.
NOTE: DATA statement used (Total process time):
real time 0.77 seconds
cpu time 0.64 seconds
87
88 Data deduped2 ;
89 /* Deduplciate when you want to preserve the original order. */
90 If ( _N_ eq 1 ) then do ;
91 Declare hash h() ;
92 h.defineKey( "&allVars" ) ;
93 h.defineDone() ;
94 End ;
95 Set new ;
96 If not h.add() then output ;
97 Run ;
NOTE: There were 500000 observations read from the data set WORK.NEW.
NOTE: The data set WORK.DEDUPED2 has 4 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 0.90 seconds
cpu time 0.86 seconds
98
99 Proc sort data=new out=deduped3 noDupKey ;
100 /* Deduplicate when you want the output data sorted. */
101 By _all_ ;
102 Run ;
NOTE: There were 500000 observations read from the data set WORK.NEW.
NOTE: SAS sort was used.
NOTE: 499996 observations with duplicate key values were deleted.
NOTE: The data set WORK.DEDUPED3 has 4 observations and 5 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 1.54 seconds
cpu time 1.51 seconds
------------------------------------------------------------------------
Ed
Edward Heaton, Senior Systems Analyst,
Westat (An Employee-Owned Research Corporation),
1650 Research Boulevard, RW-4541, Rockville, MD 20850-3195
Voice: (301) 610-4818 Fax: (301) 294-3879
mailto:EdHeaton@Westat.com http://www.Westat.com
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of Jack Clark
Sent: Wednesday, March 28, 2007 7:40 AM
To: 'Johnson, David'; SAS-L@LISTSERV.UGA.EDU
Subject: RE: Re: dropping duplicates
David,
I agree with your example related to the NODUP option, but not with the
NODUPKEY option. If you use the 5 sample observations you provided, and
sort on only the first 3 columns with the NODUPKEY option, you will end
up
with only 2 observations in the output data set.
data test;
infile cards missover;
input var1 $ var2 $ var3 $ var4 var5 $;
cards;
A A A 1 X
A A A 2 R
A A B 3 T
A A B 4 M
A A B 3 T
;
run;
proc sort data = test nodupkey out = testsrt;
by var1 var2 var3;
run;
proc print data = testsrt;
run;
Obs var1 var2 var3 var4 var5
1 A A A 1 X
2 A A B 3 T
Jack Clark
Research Analyst
Center for Health Program Development and Management
University of Maryland, Baltimore County
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
Johnson, David
Sent: Tuesday, March 27, 2007 10:31 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: dropping duplicates
Sorting by the entire data vector also overcomes an issue I was about to
remark on from a previous post.
If you use NoDups or NoDupKey on the following data but select only
the first three columns, you will still have a logical duplicate.
A A A 1 X
A A A 2 R
A A B 3 T
A A B 4 M
A A B 3 T
Since the order that is being asserted is on the first three columns,
the fourth will not figure in the sequence. Therefore, the fifth record
will still be the fifth, and the third will still be the third. Only by
adding the fourth column will the fifth record be moved into the fourth
table row, and be identified as a duplicate of the record in the third
table row.
Sorting by _ALL_ columns will achieve the same effect, however, I should
also caution that the hash table built to sequence the data will be very
large and may have implications for the availability of work space, and
the time to complete the sort. A better solution is to know and
understand the data, and set up the By clause to use the first four
columns for this data.
Kind regards
David
The Moving Finger writes; and, having writ,
Moves on: nor all your Piety nor Wit
Shall lure it back to cancel half a Line,
Nor all your Tears wash out a Word of it
Omar Khayyam (trans: Edward Fitzgerald)
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
Paula Sims
Sent: Wednesday, 28 March 2007 11:16 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: dropping duplicates
You have received many good suggestions regarding NODUPKEY but each
required listing the variables. What I found (which many probably know)
is that if you don't want to list all the variables 'cuz there are so
many, the following code works:
Proc sort data = first nodupkey; by _all_; run;
Also works for _numeric_ and _character_. By doing this you're
guaranteeing no dup records and order (for this purpose) doesn't matter.
Paula
ram.sasha@gmail.com wrote:
> I have a dataset with duplicates across 4 variables.
> have:
> id var1 var2 var3
> 1 25 18 11
> 1 25 18 11
> 1 14 16 12
> 2 21 14 11
> 2 21 14 11
> 3 18 19 12
> 3 15 11 13
> 4 11 18 12
> 4 11 18 12
>
> I want to drop the duplicates which are identical in all the 4
> variables.
>
> want:
> id var1 var2 var3
> 1 25 18 11
> 1 14 16 12
> 2 21 14 11
> 3 18 19 12
> 3 15 11 13
> 4 11 18 12
>
> Thanks
>
************** IMPORTANT MESSAGE *****************************
This e-mail message is intended only for the addressee(s) and contains
information which may be
confidential.
If you are not the intended recipient please advise the sender by return
email, do not use or
disclose the contents, and delete the message and any attachments from
your
system. Unless
specifically indicated, this email does not constitute formal advice or
commitment by the sender
or the Commonwealth Bank of Australia (ABN 48 123 123 124) or its
subsidiaries.
We can be contacted through our web site: commbank.com.au.
If you no longer wish to receive commercial electronic messages from us,
please reply to this
e-mail by typing Unsubscribe in the subject line.
**************************************************************
|