Date: Fri, 27 Feb 2004 09:00:03 -0500
Reply-To: ben.powell@CLA.CO.UK
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: ben.powell@CLA.CO.UK
Subject: Re: Data Step vs. Proc sql
Content-Type: text/plain; charset=ISO-8859-1
Return of the Jedi?
On Thu, 26 Feb 2004 23:36:46 +0000, Ian Whitlock <iw1junk@COMCAST.NET>
wrote:
>Glenn,
>
>Thanks for clarifying the discrepancy between what was asked of the two
steps.
>
>However, I think you went to far with
>
> merge a(in=ina) b(where=(x <= 499));
>
>The value 499 was not given directly to the SQL step, although it found it
and
>used it. I think a still fairer comparison would be to obtain the 499 as a
>macro variable in a preparatory DATA step acting on A to find the maximum X
>needed and then using
>
> merge a(in=ina) b(where=(x <= &maxx));
>
>Since A has only 500 records I would still expect the DATA step to win,
but by
>less.
>
>You also indexed B. I don't know whether that was in the original problem.
>Does it make a difference? If B is not indexed, SQL may make a hash index.
>
>What about the order? was that in the original problem? If you can
assume B
>is in order then
>
> if x > &maxx then stop ;
>
>may be faster than the WHERE clause.
>
>Ian_Whitlock@comcast.net
>
>================================================
>
>Date: Thu, 26 Feb 2004 14:47:06 -0500
>Reply-To: Glenn Heagerty <gheagerty@EARTHLINK.NET>
>From: Glenn Heagerty <gheagerty@EARTHLINK.NET>
>Subject: Re: Data Step vs. Proc sql
>Comments: To: rdporto1 <rdporto1@TERRA.COM.BR>
>In-Reply-To: <200402261008.1aWpVLCx3NZFmQ1@tanager>
>Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
>Hello rdporto1,
>
>Changed your code as follows:
>
> a. added index to a and b
> b. ran PROC SQL first
> c. switched a and b order in merge statement
> d. added where= data set option to b in merge step
> e. order of x is different in c then d
>
>The merge used one less second of real time. I don't know if this is what
you
>were looking for, but I think the data step in this case can be fed more
>information to make it run as quick.
>
>Very interesting,
>
>Glenn
>
>
> data a(index=(x));
> do a1=1 to 500;
> x=500-a1;
> a2=ranuni(87);
> a3='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';
> output;
> end;
> run;
>
> data b (index=(x));
> do b1=1 to 5000000;
> if b1 in (10 20 30) then x=4;
> else x=5000000-b1;
> b2=ranuni(13);
> b3=ranuni(14);
> b4=ranuni(15);
> b5=ranuni(16);
> b6=ranuni(17);
> output;
> end;
> run;
>
> proc sql;
> create table d as
> select *
> from a as a, b as b
> where a.x=b.x;
> quit;
>
> data c;
> merge a(in=ina) b(where=(x <= 499));
> by x;
> if ina;
> run;
>
>Log:
>
>36 proc sql;
>37 create table d as
>38 select *
>39 from a as a, b as b
>40 where a.x=b.x;
>WARNING: Variable x already exists on file WORK.D.
>NOTE: Table WORK.D created, with 503 rows and 10 columns.
>
>41 quit;
>NOTE: PROCEDURE SQL used (Total process time):
> real time 0.15 seconds
> user cpu time 0.05 seconds
> system cpu time 0.00 seconds
>
>44 data c;
>45 merge a(in=ina) b(where=(x <= 499));
>46 by x;
>47 if ina;
>48 run;
>
>NOTE: There were 500 observations read from the data set WORK.A.
>NOTE: There were 503 observations read from the data set WORK.B.
> WHERE x<=499;
>NOTE: The data set WORK.C has 503 observations and 10 variables.
>NOTE: DATA statement used (Total process time):
> real time 0.14 seconds
> user cpu time 0.04 seconds
> system cpu time 0.02 seconds
>
>rdporto1 wrote:
>> So, I think everybody is convinced that proc sql CAN be faster than data
step.
>>
>> My former question was if there is a way to make data step so smart as
proc sql on this special situation (using macro? set set?...) I've tried
some alternative codes unsuccessfully.
>>
>>
>> De:"SAS(r) Discussion" SAS-L@LISTSERV.UGA.EDU
>>
>> Para:SAS-L@LISTSERV.UGA.EDU
>>
>> Cópia:
>>
>> Data:Thu, 26 Feb 2004 09:09:27 -0500
>>
>> Assunto:Re: Data Step vs. Proc sql
>>
>>
>>
>>
>>>Correction: In my earlier post, I meant to say that SQL "only reads the
>>>rows it needs from B".
>>>
>>>Now here's a test where there is no index and PROC SQL and the DATA step
>>>MERGE should be doing more or less the same thing; that is, sequential
>>>matching:
>>>
>>>data b;
>>>do b1=1 to 5000000;
>>>x = b1;
>>>b2=ranuni(1);
>>>b3=ranuni(1);
>>>b4=ranuni(1);
>>>b5=ranuni(1);
>>>b6=ranuni(1);
>>>output;
>>>end;
>>>run;
>>>
>>>proc datasets library=work;
>>>modify b (sortedby = x); quit;
>>>run;
>>>
>>>data a;
>>>do a1=1 to 500;
>>>x=ceil(5000000*ranuni(1));
>>>a2=ranuni(1);
>>>a3='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';
>>>output;
>>>end;
>>>run;
>>>
>>>proc sort data=a;
>>>by x;
>>>run;
>>>
>>>proc sql;
>>>create table d as
>>>select *
>>>from a as a, b as b
>>>where a.x=b.x;
>>>quit;
>>>
>>>data c;
>>>merge b a (in=k);
>>>by x;
>>>if k;
>>>run;
>>>quit;
>>>
>>>proc compare data=c compare=d;
>>>run;
>>>
>>>On my system (obviously much less powerful than rdporto1's), the SQL step
>>>took 3 seconds and the DATA step took 8 seconds.
>>>
>>>These results depend very much on the values of X in the small table
(A). I
>>>generated the data so that these values would fall across the whole range
>>>of X values found in B. If the X values in A are all at the low end, SQL
is
>>>smart enough to stop processing after A is exhausted, whereas the DATA
step
>>>keeps reading from B.
>>>
>>>Incidental points: Compression was counterproductive in this case, so I
>>>removed that option. I used the same seed value in all of the RANUNI
calls,
>>>because in fact within a DATA step, the pseudorandom number functions
>>>maintain only one stream.
>>>
>>>On Thu, 26 Feb 2004 06:13:02 -0500, ben.powell@CLA.CO.UK wrote:
>>>
>>>
>>>>(And you should use quit not run to terminate proc datasets).
>>>>So which is faster?
>>
>>
|