Date: Fri, 9 Jul 2010 13:01:59 -0500
Reply-To: Kevin Myers <KevinMyers@AUSTIN.RR.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Kevin Myers <KevinMyers@AUSTIN.RR.COM>
Subject: Re: Where Clause on Sorted Dataset (was Re: The expensive INTNX
function (was: Pass date values))
Content-Type: text/plain; format=flowed; charset="iso-8859-1";
reply-type=original
Paul -
Actually it's quite easy to verify sorted-ness during the binary search,
which also prevents any potential infinite looping. This approach does not
*guarantee* that the data is actually 100% sorted, but is adequate in my
opinion. There are other areas where SAS empoys a similar level of "trust"
in the SORTEDBY metadata (as exhibited by your SORT example), though I
believe there may be some options for controlling this behavior.
Finally, the SORTEDBY metadata is always correct if SAS software put it
there automatically. The only case where it's wrong is if the user forces
the SORTEDBY attribute onto a dataset that isn't actually sorted (as in your
example). And really, if they did that, then that's their problem, and that
shouldn't hold the rest of us back from a substantially beneficial
performance enhancement.
s/KAM
----- Original Message -----
From: "Paul Dorfman" <sashole@BELLSOUTH.NET>
To: <SAS-L@LISTSERV.UGA.EDU>; "Kevin A. Myers" <KevinMyers@AUSTIN.RR.COM>
Sent: Friday, July 09, 2010 12:36
Subject: Re: Where Clause on Sorted Dataset (was Re: The expensive INTNX
function (was: Pass date values))
> On Thu, 8 Jul 2010 12:23:10 -0500, Kevin Myers <KevinMyers@AUSTIN.RR.COM>
> wrote:
>
>>In the case of a sorted data set, I would *hope* that SAS would search for
>>the first matching record by using a *binary*, rather than sequential,
>>search. If that isn't the case, then someone needs to suggest using a
>>binary search to locate the first matching observation as a performance
>>enhancement.
>
> Kevin,
>
> One problem with this is that the only indication that a data set *might*
> be
> sorted is SORTEDBY=, but its being set in no way guarantees that the file
> is
> actually physically ordered by the key(s) listed:
>
> data test (sortedby = key) ;
> do key = 9, 1, 3, 2, 8, 5 ;
> output ;
> end ;
> run ;
>
> Look at the table TEST's properties, and the metadata plainly tells you
> the
> data set is sorted by KEY. (It reminds me of the old Bob Virgile's puzzle,
> where a file similar to the above was explicitly sorted and the ensuing
> step
> whose BY statement relied on the sort order would fail because the
> SORTEDBY=
> fooled SORT used without FORCE option into doing nothing:
>
> 8 proc sort data = test ;
> 9 by key ;
> 10 run ;
> NOTE: Input data set is already sorted, no sorting done.)
>
> So if SAS were to infer from SORTEDBY= that the file is indeed sorted and
> therefore "binary search" it, the search algorithm will either (a) enter
> an
> infinite loop, in which case you have at least some indication of the
> trouble, albeit at an ugly expense, or (b) finish perfectly well - in
> which
> case it will produce an erroneous result giving no token whatsoever that
> something wrong has happened and which is much more dangerous.
>
> Of course the underlying software can easily check beforehand if the file
> is
> in fact physically in order, but this will be tantamount to a whole pass
> through the file, thereby defeating the reason for using the binary search
> in the first place.
>
> The only way out of the SORTEDBY= trap would be for SAS to create a system
> option banning SORTEDBY from being set by anything but a successfully
> executed SORT or ORDER BY. Then if SORTEDBY= were set the binary search
> against the keys listed under it could be used securely.
>
> Kind regards
> ------------
> Paul Dorfman
> Jax, FL
> ------------
>
|