Date: Mon, 7 Mar 2011 10:57:40 -0500
Reply-To: Arthur Tabachneck <art297@ROGERS.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Arthur Tabachneck <art297@ROGERS.COM>
Subject: Re: Proc Sort nodupkey for all columns
Content-Type: text/plain; charset=ISO-8859-1
Søren,
As I mentioned earlier in this thread, I totally agree with data_null_ that
using nodupkey and _all_ is a better and safer solution. I doubt if there
are many cases where the data is presorted and one could take advantage of
any benefit that might be gained from using nodup.
However, that said, I still think that there is a measurable difference
between the times required for the two approaches. Running the following
code, I found that twice as much cpu time was needed using the nodupkey and
_all_ method.
And, while my test only included 300,000 records and 101 variables,
increasing either the number of records or number of variables ten fold
could easily make the task measurable in minute or hours rather than
seconds.
data have (drop=j k);
array x(100);
do id = 1 to 100000;
do j=1 to 3;
do k=1 to 100;
x(k) = int(20 * ranuni(0)) + 1;
end;
output;
end;
end;
run;
proc sort data=have;
by id x1-x100;
run;
proc sort data=have out=want nodup;
by id;
run;
proc sort data=have out=want nodupkey;
by _all_;
run;
Art
-------
On Fri, 4 Mar 2011 05:08:23 -0500, S=?ISO-8859-1?Q?=C3=B8ren?= Lassen
<s.lassen@POST.TELE.DK> wrote:
>Art,
>Using SORT by _ALL_ should not normally take a couple of hours processing
>time more than sorting by a shorter key value.
>
>Most of the time used by the sort procedure is spent moving
>stuff around, not comparing data - with large tables, disk I/O takes
>much more time than comparing records.
>
>If the processing time difference is measured in hours, then my guess
>is that the total processing time for the sort step is measured in
>weeks or months.
>
>But it may depend on the sort algorithm.
>
>Regards,
>Søren
>
>On Thu, 3 Mar 2011 23:14:10 -0500, Arthur Tabachneck <art297@ROGERS.COM>
>wrote:
>
>>Mark,
>>
>>I can't agree with you more that one MUST understand exactly what is going
>>on with all of the possible options and their interactions.
>>
>>Like you, I personally almost always choose nodupkey. But, if nodup will
>do
>>what one needs, and can save a couple of hours of processing time, AND ONE
>>UNDERSTANDS THE EFFECT OF ALL OF THE OPTIONS THEY DID AND DIDN'T SELECT,
>>then I would agree to that choice.
>>
>>Art
>>-------
>>On Fri, 4 Mar 2011 01:33:45 +0000, Keintz, H. Mark
>><mkeintz@WHARTON.UPENN.EDU> wrote:
>>
>>>Art:
>>>
>>>I can think of no instance in which I would use NODUPREC instead of
>>NODUPKEY. NODUPREC does not adhere to what most people would think it
>>means. I'm sure you know this, but I'm not sure everyone does.
>>>
>>>Consider this data set:
>>>
>>>data have (drop=i);
>>> do key=3 to 2 by -1;
>>> do other=1,2,3,2,1;
>>> do i=1 to other;
>>> output;
>>> end;
>>> end;
>>> end;
>>>run;
>>>
>>>It has 6 unique records, some repeated, not always consecutively. (Let
>>OTHER represent some 300 variables).
>>>
>>>Here's what I think some people might think you are suggesting:
>>>
>>>proc sort data=have out=want noduprec;
>>> by key;
>>>run;
>>>
>>>
>>>WANT has 10 observations, not 6.
>>>
>>>Why? Because PROC SORT defaults to EQUALS, it preserves original order
of
>>all records with the same value of KEY. But since some identical records
>>are not consecutive within a key, they are not detected by NODUPREC.
>>>
>>>
>>>This is why I think data _null_ suggested _ALL_ but not noduprec:
>>>
>>>proc sort data=have out=want2 nodupkey;
>>> by key _all_;
>>>run;
>>>
>>>.. which yields 6 records.
>>>
>>>Yes, when every variable is a sort var (because of _ALL_ in the BY
>>statement), either noduprec or nodupkey will work, but IMO nodupkey is
>truer
>>to its na�ve interpretation than NODUPREC. And unless you know all your
>>identical records within a key will end up consecutive after sort, I can
>>think of no situation in which someone would benefit from using NODUPREC
>>instead of NODUPKEY/_ALL_.
>>>
>>>
>>>Regards,
>>>Mark
>>>
>>>
>>>-----Original Message-----
>>>From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
>Arthur
>>Tabachneck
>>>Sent: Thursday, March 03, 2011 6:14 PM
>>>To: SAS-L@LISTSERV.UGA.EDU
>>>Subject: Re: Proc Sort nodupkey for all columns
>>>
>>>Mark,
>>>
>>>I agree and it should take less time to sort.
>>>
>>>Art
>>>-------
>>>On Thu, 3 Mar 2011 14:36:27 -0800, Mark Miller <mdhmiller@GMAIL.COM>
>wrote:
>>>
>>>>What about using NODUPREC ?
>>>>
>>>>... mark miller
>>>>
>>>>On 3/3/2011 1:31 PM, Irin later wrote:
>>>>> Thanks a lot!
>>>>>
>>>>> --- On Thu, 3/3/11, Data _null_;<iebupdte@gmail.com> wrote:
>>>>>
>>>>>
>>>>> From: Data _null_;<iebupdte@gmail.com>
>>>>> Subject: Re: Proc Sort nodupkey for all columns
>>>>> To: "Irin later"<irinfigvam@yahoo.com>
>>>>> Cc: SAS-L@listserv.uga.edu
>>>>> Date: Thursday, March 3, 2011, 4:24 PM
>>>>>
>>>>>
>>>>> _ALL_
>>>>>
>>>>> On Thu, Mar 3, 2011 at 3:18 PM, Irin later<irinfigvam@yahoo.com>
>wrote:
>>>>>> I do a proc sort nodupkey like below to get rid of exact dups:
>>>>>>
>>>>>> Proc sort nodupkey data= CombinedFlags out=ppro.CombinedFlagsnodup;
>>>>>> by field1 field 2 ....... field300; run;
>>>>>>
>>>>>> Since I need to get rid of exact duplicate records,I have to specify
>>>each field in the record as that is what I need : to delete dups records
>>with equal values in the same columns....
>>>>>>
>>>>>> Is there any way in syntax in order I could do it without specifying
>>>hundreds of fields?
>>>>>>
>>>>>> Thank you in advance!
>>>>>>
>>>>>>
>>>>>> Irin
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
|