Date: Fri, 9 Jul 2010 12:20:01 -0500
Reply-To: Kevin Myers <KevinMyers@AUSTIN.RR.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Kevin Myers <KevinMyers@AUSTIN.RR.COM>
Subject: Re: Where Clause on Sorted Dataset (was Re: The expensive INTNX
function (was: Pass date values))
Content-Type: text/plain; format=flowed; charset="iso-8859-1";
reply-type=original
Thanks for taking the time to look into that Mark. I'll try to find some
time to submit an enhancement proposal to improve on this situation.
Regards,
Kevin M.
----- Original Message -----
From: "Keintz, H. Mark" <mkeintz@wharton.upenn.edu>
To: "Kevin Myers" <KevinMyers@AUSTIN.RR.COM>; <SAS-L@LISTSERV.UGA.EDU>
Sent: Friday, July 09, 2010 10:43
Subject: RE: Where Clause on Sorted Dataset (was Re: The expensive INTNX
function (was: Pass date values))
Kevin:
I think what you hope for is not the case. If it were, then finding the
last 1,000,000 (out of 100,000,000) in a sorted dataset would take about as
long as finding the first 1,000,000.
In the test below both HAVE and HAVE_S are sorted, but "sortedby" is set
only in HAVE_S.
Case 1 (first 1MM) and Case 2 (last 1MM) both read HAVE and take about the
same time (about 3.4 seconds on our Linux machine). No surprise.
Case 3 (first 1MM from HAVE_S) takes about 0.1 seconds. So it's doing what
Toby described.
BUT ... case 4 (last 1MM from HAVE_S) takes about the same time as CASE 1 or
2 - i.e. there is not an intelligent search for the first qualifying record
from a sorted data set.
data have;
do i=1 to 1e8; output; end;
run;
proc sort data=have out=have_s;
by i;
run;
/** Case 1, first 1,000,000, with sortedby off **/
data _null_;
set have;
where i between 1 and 1e6;
run;
/** Case 2, last 1,000,000, with sortedby off **/
data _null_;
set have;
where i between 99e6 and 1e8;
run;
/** Case 3, first 1,000,000, with sortedby on **/
data _null_;
set have_s;
where i between 1 and 1e6;
run;
/** Case 4, last 1,000,000, with sortedby on **/
data _null_;
set have_s;
where i between 99e6 and 1e8;
run;
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> Kevin Myers
> Sent: Thursday, July 08, 2010 1:23 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Where Clause on Sorted Dataset (was Re: The expensive INTNX
> function (was: Pass date values))
>
> Toby wrote, in part, regarding SAS where clause processing:
>
> > Here is what the docs actually state:
>
> > Processing a WHERE expression without an index requires SAS to
> > sequentially read
> > observations in order to find the ones that match the selection >
> > criteria. Without an index,
> > SAS first checks for the sort indicator, which is stored with the
> data
> > file from a previous
> > SORT procedure or SORTEDBY= data set option. If the sort indicator is
> > validated,
> > SAS takes advantage of it and stops reading the file once it is clear
> > there are no more
> > values that satisfy the WHERE expression. For example, consider a
> data set
> > that is sorted
> > by Age, without an index. To process the expression where age le 25,
> SAS
> > stops reading
> > observations after it finds an observation that is greater than 25.
> Note
> > that while SAS can
> > determine when to stop reading observations, without an index, there
> is no
> > indication
> > where to begin, so SAS always begins with the first observation,
> which can
> > require reading
> > a lot of observations.
>
> In the case of a sorted data set, I would *hope* that SAS would search
> for
> the first matching record by using a *binary*, rather than sequential,
> search. If that isn't the case, then someone needs to suggest using a
> binary search to locate the first matching observation as a performance
> enhancement.
>
> s/KAM
>
> cc: birdies
|