Date: Thu, 17 Nov 2011 19:55:41 -0500
Reply-To: bbser 2009 <bbser2009@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: bbser 2009 <bbser2009@GMAIL.COM>
Subject: Re: NODUPRECS, the proc sort option
In-Reply-To: <A83AA17321E97F4C908B780964488165F98B6F@SN2PRD0402MB110.namprd04.prod.outlook.com>
Content-Type: text/plain; charset="us-ascii"
Thanks to the contributors to this thread, I now understand the difference:
1. Both NODUPKEY and NODUPRECS sort the observations on the values of key
variables.
2. Both of them only compare the observation currently about to be written
into the output data set with the observation on the last line of the output
data set.
3. The difference lies in the observation-deletion criterion: NODUPKEY
discards observations based on whether the two observations mentioned above
have the same values of key variables, while NODUPRECS based on whether the
two observations have the same values across all the variables.
Since different observations could have the same values of key variables,
after applying NODUPRECS we may still get something like below in the output
data set:
A 1 3
A 1 2
A 1 3
I kind of think NODUPRECS+NOEQUALS=NODUPKEY;I will start a new thread to ask
about opinions on this.
Regards, Max
(Maaxx)
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Zdeb,
Michael S
Sent: November-17-11 3:41 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: [SAS-L] NODUPRECS, the proc sort option
hi ... I don't think that anyone has mentioned that NODUPRECS only gets rid
of all duplicate observations if you sort by all the variables in a data set
that has always made me wonder why it's needed since NODUPKEY sorting by all
variables does the same thing (and always seems to do that task a lot
faster)
352 data lots;
353 set sashelp.class;
354 do _n_=1 to ceil(1e6*ranuni(999));
355 output;
356 end;
357 run;
NOTE: There were 19 observations read from the data set SASHELP.CLASS.
NOTE: The data set WORK.LOTS has 8124255 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 21.17 seconds
cpu time 2.39 seconds
358
359 proc sort data=lots out=new nodupkey;
360 by _all_;
361 run;
NOTE: There were 8124255 observations read from the data set WORK.LOTS.
NOTE: 8124236 observations with duplicate key values were deleted.
NOTE: The data set WORK.NEW has 19 observations and 5 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 3.34 seconds
cpu time 5.70 seconds
362
363
364 proc sort data=lots out=new noduprecs;
365 by _all_;
366 run;
NOTE: There were 8124255 observations read from the data set WORK.LOTS.
NOTE: 8124236 duplicate observations were deleted.
NOTE: The data set WORK.NEW has 19 observations and 5 variables.
NOTE: PROCEDURE SORT used (Total process time):
real time 23.84 seconds
cpu time 9.79 seconds
Mike Zdeb
U@Albany School of Public Health
One University Place (Room 119)
Rensselaer, New York 12144-3456
P/518-402-6479 F/630-604-1475
________________________________________
From: SAS(r) Discussion [SAS-L@LISTSERV.UGA.EDU] on behalf of Bian, Haikuo
[HBian@FLQIO.SDPS.ORG]
Sent: Thursday, November 17, 2011 8:09 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: NODUPRECS, the proc sort option
From SAS help:
NODUPRECS
checks for and eliminates duplicate observations. If you specify this
option, then PROC SORT compares all variable values for each observation to
the ones for the "previous observation" that was written to the output data
set. If an exact match is found, then the observation is not written to the
output data set.
Please note the quoted term "previous observation", which means the
duplicated records has to be consecutive to be correctly processed. Please
see a simple example:
*sample1;
data test;
input a b c;
cards;
1 2 3
1 2 4
1 2 3
;
proc sort data=test out=ndup nodup;
by a b ;
run;
proc print;run;
The first and third records are duplicated, while either of them have been
removed by 'nodup' option, because they are not consecutive. Unless 'by'
statement includes all of the variable names, and then 'nodup' will be
equivalent to 'nodupkey'.
*sample2;
data test;
input a b c;
cards;
1 2 3
1 2 3
1 2 4
;
proc sort data=test out=ndup nodup;
by a b;
run;
proc print;run;
if they are consecutive, the duplicated record will be removed.
Regards,
Haikuo
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of bbser
2009
Sent: Tuesday, November 15, 2011 11:37 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: NODUPRECS, the proc sort option
Hi there,
I was wondering if someone happens to know a meaningful application of this
PROC SORT option NODUPRECS.
Thanks a lot.
Regards, Max
(Maaxx)
-----------------------------------------
Email messages cannot be guaranteed to be secure or error-free as
transmitted information can be intercepted, corrupted, lost,
destroyed, arrive late or incomplete, or contain viruses. The
Centers for Medicare & Medicaid Services therefore does not accept
liability for any error or omissions in the contents of this
message, which arise as a result of email transmission.
CONFIDENTIALITY NOTICE: This communication, including any
attachments, may contain confidential information and is intended
only for the individual or entity to which it is addressed. Any
review, dissemination, or copying of this communication by anyone
other than the intended recipient is strictly prohibited. If you
are not the intended recipient, please contact the sender by reply
email and delete and destroy all copies of the original message.