LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2011, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 17 Nov 2011 19:55:41 -0500
Reply-To:     bbser 2009 <bbser2009@GMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         bbser 2009 <bbser2009@GMAIL.COM>
Subject:      Re: NODUPRECS, the proc sort option
Comments: To: "Zdeb, Michael S" <mzdeb@ALBANY.EDU>
In-Reply-To:  <A83AA17321E97F4C908B780964488165F98B6F@SN2PRD0402MB110.namprd04.prod.outlook.com>
Content-Type: text/plain; charset="us-ascii"

Thanks to the contributors to this thread, I now understand the difference:

1. Both NODUPKEY and NODUPRECS sort the observations on the values of key variables. 2. Both of them only compare the observation currently about to be written into the output data set with the observation on the last line of the output data set. 3. The difference lies in the observation-deletion criterion: NODUPKEY discards observations based on whether the two observations mentioned above have the same values of key variables, while NODUPRECS based on whether the two observations have the same values across all the variables. Since different observations could have the same values of key variables, after applying NODUPRECS we may still get something like below in the output data set: A 1 3 A 1 2 A 1 3

I kind of think NODUPRECS+NOEQUALS=NODUPKEY;I will start a new thread to ask about opinions on this.

Regards, Max (Maaxx)

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Zdeb, Michael S Sent: November-17-11 3:41 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: [SAS-L] NODUPRECS, the proc sort option

hi ... I don't think that anyone has mentioned that NODUPRECS only gets rid of all duplicate observations if you sort by all the variables in a data set

that has always made me wonder why it's needed since NODUPKEY sorting by all variables does the same thing (and always seems to do that task a lot faster)

352 data lots; 353 set sashelp.class; 354 do _n_=1 to ceil(1e6*ranuni(999)); 355 output; 356 end; 357 run;

NOTE: There were 19 observations read from the data set SASHELP.CLASS. NOTE: The data set WORK.LOTS has 8124255 observations and 5 variables. NOTE: DATA statement used (Total process time): real time 21.17 seconds cpu time 2.39 seconds

358 359 proc sort data=lots out=new nodupkey; 360 by _all_; 361 run;

NOTE: There were 8124255 observations read from the data set WORK.LOTS. NOTE: 8124236 observations with duplicate key values were deleted. NOTE: The data set WORK.NEW has 19 observations and 5 variables. NOTE: PROCEDURE SORT used (Total process time): real time 3.34 seconds cpu time 5.70 seconds

362 363 364 proc sort data=lots out=new noduprecs; 365 by _all_; 366 run;

NOTE: There were 8124255 observations read from the data set WORK.LOTS. NOTE: 8124236 duplicate observations were deleted. NOTE: The data set WORK.NEW has 19 observations and 5 variables. NOTE: PROCEDURE SORT used (Total process time): real time 23.84 seconds cpu time 9.79 seconds

Mike Zdeb U@Albany School of Public Health One University Place (Room 119) Rensselaer, New York 12144-3456 P/518-402-6479 F/630-604-1475

________________________________________ From: SAS(r) Discussion [SAS-L@LISTSERV.UGA.EDU] on behalf of Bian, Haikuo [HBian@FLQIO.SDPS.ORG] Sent: Thursday, November 17, 2011 8:09 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: NODUPRECS, the proc sort option

From SAS help:

NODUPRECS checks for and eliminates duplicate observations. If you specify this option, then PROC SORT compares all variable values for each observation to the ones for the "previous observation" that was written to the output data set. If an exact match is found, then the observation is not written to the output data set.

Please note the quoted term "previous observation", which means the duplicated records has to be consecutive to be correctly processed. Please see a simple example: *sample1; data test; input a b c; cards; 1 2 3 1 2 4 1 2 3 ;

proc sort data=test out=ndup nodup; by a b ; run;

proc print;run;

The first and third records are duplicated, while either of them have been removed by 'nodup' option, because they are not consecutive. Unless 'by' statement includes all of the variable names, and then 'nodup' will be equivalent to 'nodupkey'.

*sample2; data test; input a b c; cards; 1 2 3 1 2 3 1 2 4 ;

proc sort data=test out=ndup nodup; by a b; run;

proc print;run;

if they are consecutive, the duplicated record will be removed.

Regards, Haikuo

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of bbser 2009 Sent: Tuesday, November 15, 2011 11:37 PM To: SAS-L@LISTSERV.UGA.EDU Subject: NODUPRECS, the proc sort option

Hi there,

I was wondering if someone happens to know a meaningful application of this PROC SORT option NODUPRECS. Thanks a lot.

Regards, Max (Maaxx) ----------------------------------------- Email messages cannot be guaranteed to be secure or error-free as transmitted information can be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or contain viruses. The Centers for Medicare & Medicaid Services therefore does not accept liability for any error or omissions in the contents of this message, which arise as a result of email transmission.

CONFIDENTIALITY NOTICE: This communication, including any attachments, may contain confidential information and is intended only for the individual or entity to which it is addressed. Any review, dissemination, or copying of this communication by anyone other than the intended recipient is strictly prohibited. If you are not the intended recipient, please contact the sender by reply email and delete and destroy all copies of the original message.


Back to: Top of message | Previous page | Main SAS-L page