LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2010, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 9 Jul 2010 12:20:01 -0500
Reply-To:     Kevin Myers <KevinMyers@AUSTIN.RR.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Kevin Myers <KevinMyers@AUSTIN.RR.COM>
Subject:      Re: Where Clause on Sorted Dataset (was Re: The expensive INTNX
              function  (was: Pass date values))
Comments: To: "Keintz, H. Mark" <mkeintz@wharton.upenn.edu>
Content-Type: text/plain; format=flowed; charset="iso-8859-1";
              reply-type=original

Thanks for taking the time to look into that Mark. I'll try to find some time to submit an enhancement proposal to improve on this situation.

Regards, Kevin M.

----- Original Message ----- From: "Keintz, H. Mark" <mkeintz@wharton.upenn.edu> To: "Kevin Myers" <KevinMyers@AUSTIN.RR.COM>; <SAS-L@LISTSERV.UGA.EDU> Sent: Friday, July 09, 2010 10:43 Subject: RE: Where Clause on Sorted Dataset (was Re: The expensive INTNX function (was: Pass date values))

Kevin:

I think what you hope for is not the case. If it were, then finding the last 1,000,000 (out of 100,000,000) in a sorted dataset would take about as long as finding the first 1,000,000.

In the test below both HAVE and HAVE_S are sorted, but "sortedby" is set only in HAVE_S.

Case 1 (first 1MM) and Case 2 (last 1MM) both read HAVE and take about the same time (about 3.4 seconds on our Linux machine). No surprise.

Case 3 (first 1MM from HAVE_S) takes about 0.1 seconds. So it's doing what Toby described.

BUT ... case 4 (last 1MM from HAVE_S) takes about the same time as CASE 1 or 2 - i.e. there is not an intelligent search for the first qualifying record from a sorted data set.

data have; do i=1 to 1e8; output; end; run;

proc sort data=have out=have_s; by i; run;

/** Case 1, first 1,000,000, with sortedby off **/ data _null_; set have; where i between 1 and 1e6; run;

/** Case 2, last 1,000,000, with sortedby off **/ data _null_; set have; where i between 99e6 and 1e8; run;

/** Case 3, first 1,000,000, with sortedby on **/ data _null_; set have_s; where i between 1 and 1e6; run;

/** Case 4, last 1,000,000, with sortedby on **/ data _null_; set have_s; where i between 99e6 and 1e8; run;

> -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > Kevin Myers > Sent: Thursday, July 08, 2010 1:23 PM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Where Clause on Sorted Dataset (was Re: The expensive INTNX > function (was: Pass date values)) > > Toby wrote, in part, regarding SAS where clause processing: > > > Here is what the docs actually state: > > > Processing a WHERE expression without an index requires SAS to > > sequentially read > > observations in order to find the ones that match the selection > > > criteria. Without an index, > > SAS first checks for the sort indicator, which is stored with the > data > > file from a previous > > SORT procedure or SORTEDBY= data set option. If the sort indicator is > > validated, > > SAS takes advantage of it and stops reading the file once it is clear > > there are no more > > values that satisfy the WHERE expression. For example, consider a > data set > > that is sorted > > by Age, without an index. To process the expression where age le 25, > SAS > > stops reading > > observations after it finds an observation that is greater than 25. > Note > > that while SAS can > > determine when to stop reading observations, without an index, there > is no > > indication > > where to begin, so SAS always begins with the first observation, > which can > > require reading > > a lot of observations. > > In the case of a sorted data set, I would *hope* that SAS would search > for > the first matching record by using a *binary*, rather than sequential, > search. If that isn't the case, then someone needs to suggest using a > binary search to locate the first matching observation as a performance > enhancement. > > s/KAM > > cc: birdies


Back to: Top of message | Previous page | Main SAS-L page