LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (February 2004, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 27 Feb 2004 09:00:03 -0500
Reply-To:     ben.powell@CLA.CO.UK
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         ben.powell@CLA.CO.UK
Subject:      Re: Data Step vs. Proc sql
Content-Type: text/plain; charset=ISO-8859-1

Return of the Jedi?

On Thu, 26 Feb 2004 23:36:46 +0000, Ian Whitlock <iw1junk@COMCAST.NET> wrote:

>Glenn, > >Thanks for clarifying the discrepancy between what was asked of the two steps. > >However, I think you went to far with > > merge a(in=ina) b(where=(x <= 499)); > >The value 499 was not given directly to the SQL step, although it found it and >used it. I think a still fairer comparison would be to obtain the 499 as a >macro variable in a preparatory DATA step acting on A to find the maximum X >needed and then using > > merge a(in=ina) b(where=(x <= &maxx)); > >Since A has only 500 records I would still expect the DATA step to win, but by >less. > >You also indexed B. I don't know whether that was in the original problem. >Does it make a difference? If B is not indexed, SQL may make a hash index. > >What about the order? was that in the original problem? If you can assume B >is in order then > > if x > &maxx then stop ; > >may be faster than the WHERE clause. > >Ian_Whitlock@comcast.net > >================================================ > >Date: Thu, 26 Feb 2004 14:47:06 -0500 >Reply-To: Glenn Heagerty <gheagerty@EARTHLINK.NET> >From: Glenn Heagerty <gheagerty@EARTHLINK.NET> >Subject: Re: Data Step vs. Proc sql >Comments: To: rdporto1 <rdporto1@TERRA.COM.BR> >In-Reply-To: <200402261008.1aWpVLCx3NZFmQ1@tanager> >Content-Type: text/plain; charset=ISO-8859-1; format=flowed > >Hello rdporto1, > >Changed your code as follows: > > a. added index to a and b > b. ran PROC SQL first > c. switched a and b order in merge statement > d. added where= data set option to b in merge step > e. order of x is different in c then d > >The merge used one less second of real time. I don't know if this is what you >were looking for, but I think the data step in this case can be fed more >information to make it run as quick. > >Very interesting, > >Glenn > > > data a(index=(x)); > do a1=1 to 500; > x=500-a1; > a2=ranuni(87); > a3='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'; > output; > end; > run; > > data b (index=(x)); > do b1=1 to 5000000; > if b1 in (10 20 30) then x=4; > else x=5000000-b1; > b2=ranuni(13); > b3=ranuni(14); > b4=ranuni(15); > b5=ranuni(16); > b6=ranuni(17); > output; > end; > run; > > proc sql; > create table d as > select * > from a as a, b as b > where a.x=b.x; > quit; > > data c; > merge a(in=ina) b(where=(x <= 499)); > by x; > if ina; > run; > >Log: > >36 proc sql; >37 create table d as >38 select * >39 from a as a, b as b >40 where a.x=b.x; >WARNING: Variable x already exists on file WORK.D. >NOTE: Table WORK.D created, with 503 rows and 10 columns. > >41 quit; >NOTE: PROCEDURE SQL used (Total process time): > real time 0.15 seconds > user cpu time 0.05 seconds > system cpu time 0.00 seconds > >44 data c; >45 merge a(in=ina) b(where=(x <= 499)); >46 by x; >47 if ina; >48 run; > >NOTE: There were 500 observations read from the data set WORK.A. >NOTE: There were 503 observations read from the data set WORK.B. > WHERE x<=499; >NOTE: The data set WORK.C has 503 observations and 10 variables. >NOTE: DATA statement used (Total process time): > real time 0.14 seconds > user cpu time 0.04 seconds > system cpu time 0.02 seconds > >rdporto1 wrote: >> So, I think everybody is convinced that proc sql CAN be faster than data step. >> >> My former question was if there is a way to make data step so smart as proc sql on this special situation (using macro? set set?...) I've tried some alternative codes unsuccessfully. >> >> >> De:"SAS(r) Discussion" SAS-L@LISTSERV.UGA.EDU >> >> Para:SAS-L@LISTSERV.UGA.EDU >> >> Cópia: >> >> Data:Thu, 26 Feb 2004 09:09:27 -0500 >> >> Assunto:Re: Data Step vs. Proc sql >> >> >> >> >>>Correction: In my earlier post, I meant to say that SQL "only reads the >>>rows it needs from B". >>> >>>Now here's a test where there is no index and PROC SQL and the DATA step >>>MERGE should be doing more or less the same thing; that is, sequential >>>matching: >>> >>>data b; >>>do b1=1 to 5000000; >>>x = b1; >>>b2=ranuni(1); >>>b3=ranuni(1); >>>b4=ranuni(1); >>>b5=ranuni(1); >>>b6=ranuni(1); >>>output; >>>end; >>>run; >>> >>>proc datasets library=work; >>>modify b (sortedby = x); quit; >>>run; >>> >>>data a; >>>do a1=1 to 500; >>>x=ceil(5000000*ranuni(1)); >>>a2=ranuni(1); >>>a3='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'; >>>output; >>>end; >>>run; >>> >>>proc sort data=a; >>>by x; >>>run; >>> >>>proc sql; >>>create table d as >>>select * >>>from a as a, b as b >>>where a.x=b.x; >>>quit; >>> >>>data c; >>>merge b a (in=k); >>>by x; >>>if k; >>>run; >>>quit; >>> >>>proc compare data=c compare=d; >>>run; >>> >>>On my system (obviously much less powerful than rdporto1's), the SQL step >>>took 3 seconds and the DATA step took 8 seconds. >>> >>>These results depend very much on the values of X in the small table (A). I >>>generated the data so that these values would fall across the whole range >>>of X values found in B. If the X values in A are all at the low end, SQL is >>>smart enough to stop processing after A is exhausted, whereas the DATA step >>>keeps reading from B. >>> >>>Incidental points: Compression was counterproductive in this case, so I >>>removed that option. I used the same seed value in all of the RANUNI calls, >>>because in fact within a DATA step, the pseudorandom number functions >>>maintain only one stream. >>> >>>On Thu, 26 Feb 2004 06:13:02 -0500, ben.powell@CLA.CO.UK wrote: >>> >>> >>>>(And you should use quit not run to terminate proc datasets). >>>>So which is faster? >> >>


Back to: Top of message | Previous page | Main SAS-L page