Date: Wed, 26 Sep 2001 10:53:48 -0400
Reply-To: "Diskin, Dennis" <Dennis.Diskin@PHARMA.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Diskin, Dennis" <Dennis.Diskin@PHARMA.COM>
Subject: Re: Speed of =: vs. SUBSTR()...
Content-Type: text/plain; charset="iso-8859-1"
FYI, I ran the same test under 6.12 UNIX with the following results:
77 data _null_;
78 DXLSF='29000';
79 do i=1 to 1e7;
80 if substr(DXLSF,1,3) in ('290','295','297','298') then d=1;
81 end;
82 run;
NOTE: DATA statement used:
real time 12.72 seconds
cpu time 12.66 seconds
83
84
85
86
87 data _null_;
88 DXLSF='29000';
89 do i=1 to 1e7;
90 if DXLSF in: ('290','295','297','298') then d=1;
91 end;
92 run;
NOTE: DATA statement used:
real time 8.27 seconds
cpu time 8.20 seconds
The substr still takes longer but not so dramatically.
Then I said (to myself), this may not be fair because the argument always
matches the first item. So I ran:
93 data _null_;
94 DXLSF='29800';
95 do i=1 to 1e7;
96 if substr(DXLSF,1,3) in ('290','295','297','298') then d=1;
97 end;
98 run;
NOTE: DATA statement used:
real time 33.39 seconds
cpu time 23.95 seconds
99
100
101
102
103 data _null_;
104 DXLSF='29800';
105 do i=1 to 1e7;
106 if DXLSF in: ('290','295','297','298') then d=1;
107 end;
108 run;
NOTE: DATA statement used:
real time 17.44 seconds
cpu time 17.38 seconds
While the results are similar, this points out the fact that major
optimization can be achieved by putting the list of codes in frequency
order, either through common sense/experience or a quick freq on a sample.
hth,
Dennis Diskin
> -----Original Message-----
> From: kviel [SMTP:kviel@GMCF.ORG]
> Sent: Wednesday, September 26, 2001 10:29 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: Speed of =: vs. SUBSTR()...
>
> Laurel,
>
> I ran this test using V8.1 on Windows NT:
>
> 309 data _null_;
> 310 DXLSF='29000';
> 311 do i=1 to 1e7;
> 312 if substr(DXLSF,1,3) in ('290','295','297','298') then d=1;
> 313 end;
> 314 run;
>
> NOTE: DATA statement used:
> real time 4.82 seconds
> cpu time 4.82 seconds
>
>
> 315
> 316 data _null_;
> 317 DXLSF='29000';
> 318 do i=1 to 1e7;
> 319 if DXLSF in: ('290','295','297','298') then d=1;
> 320 end;
> 321 run;
>
> NOTE: DATA statement used:
> real time 1.39 seconds
> cpu time 1.39 seconds
>
> I *think* that because we are interested in the first m bytes
> (1-m)
> instead of bytes n-m (where n ne 1), that the colon is faster. I am not
> sure that the colon could be used otherwise (if fact, I think not).
> Basically, you have coded a series of OR's. As I understand it, once the
> first true (1) is encounter the execution ends. I believe if you search
> the
> archives, Dale McLerran has posted on this topic this year (2001).
> Therefore, I would change the substr()'s to in: and test it on a
> subset of your data. It appears as if this will reduce your time.
>
> Having just read Dennis' reply. I want to doubly emphasize a test
> (the results of which you could post). I performed my test using a data
> step. I was interested in the speed of the function. I am unsure whether
> SQL will have the same results. If not, perhaps you could write a data
> step
> instead.
>
> Regards,
>
> Kevin
>
> Kevin Viel
> Georgia Medical Care Foundation
> 57 Executive Park South, NE
> suite 200
> Atlanta, GA 30329-2224
>
>
>
>
> -----Original Message-----
> From: Laurel Copeland [mailto:lacop@UMICH.EDU]
> Sent: Wednesday, September 26, 2001 9:49 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Speed of =: vs. SUBSTR()...
>
>
> What would be fastest for a set of substring values? E.g., still with the
> 5-character diagnostic codes, I want to pull off cases with any diagnosis
> beginning with, say, 290, 295, 297, or 298. Running through 10 fields on
> a
> tape containing a few million records, I could use (SAS 6.09, MVS):
>
> Proc SQL;
> CREATE table work.REG88a
> AS SELECT PM.SCRSSN,
> PM.DXLSF,PM.DXF2,PM.DXF3,PM.DXF4,PM.DXF5,
> PM.DXF6,PM.DXF7,PM.DXF8,PM.DXF9,PM.DXF10
> FROM pm88.pm88 AS pm
> WHERE substr(PM.DXLSF,1,3) IN('290','295','297','298')
> OR substr(PM.DXF2,1,3) IN('290','295','297','298')
> OR substr(PM.DXF3,1,3) IN('290','295','297','298')
> OR substr(PM.DXF4,1,3) IN('290','295','297','298')
> OR substr(PM.DXF5,1,3) IN('290','295','297','298')
> OR substr(PM.DXF6,1,3) IN('290','295','297','298')
> OR substr(PM.DXF7,1,3) IN('290','295','297','298')
> OR substr(PM.DXF8,1,3) IN('290','295','297','298')
> OR substr(PM.DXF9,1,3) IN('290','295','297','298')
> OR substr(PM.DXF10,1,3) IN('290','295','297','298');
>
> Is this my best bet?
> Laurel Copeland
>
> At 09:24 AM 9/26/01 -0400, kviel wrote:
>
> Wim,
>
> I apologize. I would like to modify my previous suggestion. If,
> indeed, we are
> interested in this effect:
>
> if substr(c(i),1,1)='8' then d=1;
>
> Then this should really interest us:
>
> 87 data _null_;
> 88 z01='80000';
> 89 do i=1 to 1e7;
> 90 if substr(z01,1,1)='8' then d=1;
> 91 end;
> 92 run;
>
> NOTE: DATA statement used:
> real time 3.98 seconds
> cpu time 3.98 seconds
>
>
> 93
> 94 data _null_;
> 95 z01='80000';
> 96 do i=1 to 1e7;
> 97 if z01=:'8' then d=1;
> 98 end;
> 99 run;
>
> NOTE: DATA statement used:
> real time 0.27 seconds
> cpu time 0.27 seconds
>
> In other words, on a Windows NT machine running v8.1, the substr()
> function
> was not as fast as the colon.
|