Date: Fri, 9 Jul 2010 15:43:56 +0000
Reply-To: "Keintz, H. Mark" <mkeintz@WHARTON.UPENN.EDU>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Keintz, H. Mark" <mkeintz@WHARTON.UPENN.EDU>
Subject: Re: Where Clause on Sorted Dataset (was Re: The expensive INTNX
function (was: Pass date values))
In-Reply-To: <32F65931040F4D969BD7AEC7DAD6F7A5@KAM1720>
Content-Type: text/plain; charset="us-ascii"
Kevin:
I think what you hope for is not the case. If it were, then finding the last 1,000,000 (out of 100,000,000) in a sorted dataset would take about as long as finding the first 1,000,000.
In the test below both HAVE and HAVE_S are sorted, but "sortedby" is set only in HAVE_S.
Case 1 (first 1MM) and Case 2 (last 1MM) both read HAVE and take about the same time (about 3.4 seconds on our Linux machine). No surprise.
Case 3 (first 1MM from HAVE_S) takes about 0.1 seconds. So it's doing what Toby described.
BUT ... case 4 (last 1MM from HAVE_S) takes about the same time as CASE 1 or 2 - i.e. there is not an intelligent search for the first qualifying record from a sorted data set.
data have;
do i=1 to 1e8; output; end;
run;
proc sort data=have out=have_s;
by i;
run;
/** Case 1, first 1,000,000, with sortedby off **/
data _null_;
set have;
where i between 1 and 1e6;
run;
/** Case 2, last 1,000,000, with sortedby off **/
data _null_;
set have;
where i between 99e6 and 1e8;
run;
/** Case 3, first 1,000,000, with sortedby on **/
data _null_;
set have_s;
where i between 1 and 1e6;
run;
/** Case 4, last 1,000,000, with sortedby on **/
data _null_;
set have_s;
where i between 99e6 and 1e8;
run;
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> Kevin Myers
> Sent: Thursday, July 08, 2010 1:23 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Where Clause on Sorted Dataset (was Re: The expensive INTNX
> function (was: Pass date values))
>
> Toby wrote, in part, regarding SAS where clause processing:
>
> > Here is what the docs actually state:
>
> > Processing a WHERE expression without an index requires SAS to
> > sequentially read
> > observations in order to find the ones that match the selection >
> > criteria. Without an index,
> > SAS first checks for the sort indicator, which is stored with the
> data
> > file from a previous
> > SORT procedure or SORTEDBY= data set option. If the sort indicator is
> > validated,
> > SAS takes advantage of it and stops reading the file once it is clear
> > there are no more
> > values that satisfy the WHERE expression. For example, consider a
> data set
> > that is sorted
> > by Age, without an index. To process the expression where age le 25,
> SAS
> > stops reading
> > observations after it finds an observation that is greater than 25.
> Note
> > that while SAS can
> > determine when to stop reading observations, without an index, there
> is no
> > indication
> > where to begin, so SAS always begins with the first observation,
> which can
> > require reading
> > a lot of observations.
>
> In the case of a sorted data set, I would *hope* that SAS would search
> for
> the first matching record by using a *binary*, rather than sequential,
> search. If that isn't the case, then someone needs to suggest using a
> binary search to locate the first matching observation as a performance
> enhancement.
>
> s/KAM
>
> cc: birdies
|