Date: Fri, 27 Feb 2004 09:00:03 -0500
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
Subject: Re: Data Step vs. Proc sql
Content-Type: text/plain; charset=ISO-8859-1
Return of the Jedi?
On Thu, 26 Feb 2004 23:36:46 +0000, Ian Whitlock <iw1junk@COMCAST.NET>
>Thanks for clarifying the discrepancy between what was asked of the two
>However, I think you went to far with
> merge a(in=ina) b(where=(x <= 499));
>The value 499 was not given directly to the SQL step, although it found it
>used it. I think a still fairer comparison would be to obtain the 499 as a
>macro variable in a preparatory DATA step acting on A to find the maximum X
>needed and then using
> merge a(in=ina) b(where=(x <= &maxx));
>Since A has only 500 records I would still expect the DATA step to win,
>You also indexed B. I don't know whether that was in the original problem.
>Does it make a difference? If B is not indexed, SQL may make a hash index.
>What about the order? was that in the original problem? If you can
>is in order then
> if x > &maxx then stop ;
>may be faster than the WHERE clause.
>Date: Thu, 26 Feb 2004 14:47:06 -0500
>Reply-To: Glenn Heagerty <gheagerty@EARTHLINK.NET>
>From: Glenn Heagerty <gheagerty@EARTHLINK.NET>
>Subject: Re: Data Step vs. Proc sql
>Comments: To: rdporto1 <rdporto1@TERRA.COM.BR>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>Changed your code as follows:
> a. added index to a and b
> b. ran PROC SQL first
> c. switched a and b order in merge statement
> d. added where= data set option to b in merge step
> e. order of x is different in c then d
>The merge used one less second of real time. I don't know if this is what
>were looking for, but I think the data step in this case can be fed more
>information to make it run as quick.
> data a(index=(x));
> do a1=1 to 500;
> data b (index=(x));
> do b1=1 to 5000000;
> if b1 in (10 20 30) then x=4;
> else x=5000000-b1;
> proc sql;
> create table d as
> select *
> from a as a, b as b
> where a.x=b.x;
> data c;
> merge a(in=ina) b(where=(x <= 499));
> by x;
> if ina;
>36 proc sql;
>37 create table d as
>38 select *
>39 from a as a, b as b
>40 where a.x=b.x;
>WARNING: Variable x already exists on file WORK.D.
>NOTE: Table WORK.D created, with 503 rows and 10 columns.
>NOTE: PROCEDURE SQL used (Total process time):
> real time 0.15 seconds
> user cpu time 0.05 seconds
> system cpu time 0.00 seconds
>44 data c;
>45 merge a(in=ina) b(where=(x <= 499));
>46 by x;
>47 if ina;
>NOTE: There were 500 observations read from the data set WORK.A.
>NOTE: There were 503 observations read from the data set WORK.B.
> WHERE x<=499;
>NOTE: The data set WORK.C has 503 observations and 10 variables.
>NOTE: DATA statement used (Total process time):
> real time 0.14 seconds
> user cpu time 0.04 seconds
> system cpu time 0.02 seconds
>> So, I think everybody is convinced that proc sql CAN be faster than data
>> My former question was if there is a way to make data step so smart as
proc sql on this special situation (using macro? set set?...) I've tried
some alternative codes unsuccessfully.
>> De:"SAS(r) Discussion" SAS-L@LISTSERV.UGA.EDU
>> Data:Thu, 26 Feb 2004 09:09:27 -0500
>> Assunto:Re: Data Step vs. Proc sql
>>>Correction: In my earlier post, I meant to say that SQL "only reads the
>>>rows it needs from B".
>>>Now here's a test where there is no index and PROC SQL and the DATA step
>>>MERGE should be doing more or less the same thing; that is, sequential
>>>do b1=1 to 5000000;
>>>x = b1;
>>>proc datasets library=work;
>>>modify b (sortedby = x); quit;
>>>do a1=1 to 500;
>>>proc sort data=a;
>>>create table d as
>>>from a as a, b as b
>>>merge b a (in=k);
>>>proc compare data=c compare=d;
>>>On my system (obviously much less powerful than rdporto1's), the SQL step
>>>took 3 seconds and the DATA step took 8 seconds.
>>>These results depend very much on the values of X in the small table
>>>generated the data so that these values would fall across the whole range
>>>of X values found in B. If the X values in A are all at the low end, SQL
>>>smart enough to stop processing after A is exhausted, whereas the DATA
>>>keeps reading from B.
>>>Incidental points: Compression was counterproductive in this case, so I
>>>removed that option. I used the same seed value in all of the RANUNI
>>>because in fact within a DATA step, the pseudorandom number functions
>>>maintain only one stream.
>>>On Thu, 26 Feb 2004 06:13:02 -0500, ben.powell@CLA.CO.UK wrote:
>>>>(And you should use quit not run to terminate proc datasets).
>>>>So which is faster?