Date: Wed, 26 Feb 1997 08:50:12 CST
Reply-To: Undetermined origin c/o LISTSERV maintainer
<owner-LISTSERV@AKH-WIEN.AC.AT>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: Undetermined origin c/o LISTSERV maintainer
<owner-LISTSERV@AKH-WIEN.AC.AT>
Subject: Re: Losing Duplicates Inconsistently (PROC SORT)
This is about the most annoying "feature" of SAS (other than not being
able to output the duplicates found in a proc sort to another file).
The way you have to deal with this is to sort the database using EVERY
VARIABLE as the key. You can use the NODUP or NODUPKEY options when
you do this, and you'll lose all of your duplicates...
But hey, at least you're not using Foxpro or something like that...
Bruce Johnson
bjohnson@sachs.com
______________________________ Reply Separator _________________________________
Subject: Losing Duplicates Inconsistently (PROC SORT)
Author: Dan Keating <dtkeats@ibm.net> at Internet
Date: 2/25/97 5:35 PM
I am combining two datasets and then
running proc sort with "nodup" to eliminate
duplicates.
When I sort by one variable, I lose more
than a third of the records. When I sort by a
different variable, though, I don't lose any. If
I were using "nodupkey" this would make sense.
But I'm not.
(Running v6.12 on OS/2 v4.0.)
Here's the code and responses from the
output:
*******************************************
version 1 -- duplicates are eliminated:
*******************************************
649 data offmastr (keep=copid offname offnum
agency);
650 set witness.officer witness.officerj;
651 run;
NOTE: The data set WORK.OFFMASTR has 12581
observations and 4 variables.
NOTE: The DATA statement used 1.56 seconds.
652
653 proc sort data=offmastr nodup;
654 by copid;
655 run;
NOTE: 4828 duplicate observations were deleted.
NOTE: The data set WORK.OFFMASTR has 7753
observations and 4 variables.
NOTE: The PROCEDURE SORT used 1.0 seconds.
***********************************
version 2 -- no duplicates eliminated:
************************************
619 data offmastr (keep=copid offname offnum
agency);
620 set witness.officer witness.officerj;
621 run;
NOTE: The data set WORK.OFFMASTR has 12581
observations and 4 variables.
NOTE: The DATA statement used 1.56 seconds.
622
623 proc sort data=offmastr nodup;
624 by agency;
625 run;
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.OFFMASTR has 12581
observations and 4 variables.
NOTE: The PROCEDURE SORT used 1.35 seconds.
*****************************************
end of sample code
****************************************
I've rerun this several times. I've even
run the two sorts together as follows:
*************************************
version 3 -- two sorts together, one eliminates,
other doesn't:
**************************************
682
683 data offmastr (keep=copid offname offnum
agency);
684 set witness.officer witness.officerj;
685 run;
NOTE: The data set WORK.OFFMASTR has 12581
observations and 4 variables.
NOTE: The DATA statement used 1.81 seconds.
686
687 proc sort data=offmastr nodup;
688 by agency;
689 run;
NOTE: 0 duplicate observations were deleted.
NOTE: The data set WORK.OFFMASTR has 12581
observations and 4 variables.
NOTE: The PROCEDURE SORT used 1.34 seconds.
690
691 proc sort data=offmastr nodup;
692 by copid;
693 run;
NOTE: 4828 duplicate observations were deleted.
NOTE: The data set WORK.OFFMASTR has 7753
observations and 4 variables.
NOTE: The PROCEDURE SORT used 1.12 seconds.
*****************************************
end of sample code
*****************************************
I apologize for the length of this
posting, but I'm trying to document what's
happening in hopes that someone can see what I'm
missing.
Any help greatly appreciated.
Dan
Dan Keating
Miami Herald
(305) 376-3476 -- phone
(305) 376-5287 -- fax
dtkeats@ibm.net