LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2001, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 26 Sep 2001 10:53:48 -0400
Reply-To:     "Diskin, Dennis" <Dennis.Diskin@PHARMA.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Diskin, Dennis" <Dennis.Diskin@PHARMA.COM>
Subject:      Re: Speed of =: vs. SUBSTR()...
Comments: To: kviel <kviel@GMCF.ORG>
Content-Type: text/plain; charset="iso-8859-1"

FYI, I ran the same test under 6.12 UNIX with the following results:

77 data _null_; 78 DXLSF='29000'; 79 do i=1 to 1e7; 80 if substr(DXLSF,1,3) in ('290','295','297','298') then d=1; 81 end; 82 run;

NOTE: DATA statement used: real time 12.72 seconds cpu time 12.66 seconds

83 84 85 86 87 data _null_; 88 DXLSF='29000'; 89 do i=1 to 1e7; 90 if DXLSF in: ('290','295','297','298') then d=1; 91 end; 92 run;

NOTE: DATA statement used: real time 8.27 seconds cpu time 8.20 seconds

The substr still takes longer but not so dramatically.

Then I said (to myself), this may not be fair because the argument always matches the first item. So I ran:

93 data _null_; 94 DXLSF='29800'; 95 do i=1 to 1e7; 96 if substr(DXLSF,1,3) in ('290','295','297','298') then d=1; 97 end; 98 run;

NOTE: DATA statement used: real time 33.39 seconds cpu time 23.95 seconds

99 100 101 102 103 data _null_; 104 DXLSF='29800'; 105 do i=1 to 1e7; 106 if DXLSF in: ('290','295','297','298') then d=1; 107 end; 108 run;

NOTE: DATA statement used: real time 17.44 seconds cpu time 17.38 seconds

While the results are similar, this points out the fact that major optimization can be achieved by putting the list of codes in frequency order, either through common sense/experience or a quick freq on a sample.

hth, Dennis Diskin

> -----Original Message----- > From: kviel [SMTP:kviel@GMCF.ORG] > Sent: Wednesday, September 26, 2001 10:29 AM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Re: Speed of =: vs. SUBSTR()... > > Laurel, > > I ran this test using V8.1 on Windows NT: > > 309 data _null_; > 310 DXLSF='29000'; > 311 do i=1 to 1e7; > 312 if substr(DXLSF,1,3) in ('290','295','297','298') then d=1; > 313 end; > 314 run; > > NOTE: DATA statement used: > real time 4.82 seconds > cpu time 4.82 seconds > > > 315 > 316 data _null_; > 317 DXLSF='29000'; > 318 do i=1 to 1e7; > 319 if DXLSF in: ('290','295','297','298') then d=1; > 320 end; > 321 run; > > NOTE: DATA statement used: > real time 1.39 seconds > cpu time 1.39 seconds > > I *think* that because we are interested in the first m bytes > (1-m) > instead of bytes n-m (where n ne 1), that the colon is faster. I am not > sure that the colon could be used otherwise (if fact, I think not). > Basically, you have coded a series of OR's. As I understand it, once the > first true (1) is encounter the execution ends. I believe if you search > the > archives, Dale McLerran has posted on this topic this year (2001). > Therefore, I would change the substr()'s to in: and test it on a > subset of your data. It appears as if this will reduce your time. > > Having just read Dennis' reply. I want to doubly emphasize a test > (the results of which you could post). I performed my test using a data > step. I was interested in the speed of the function. I am unsure whether > SQL will have the same results. If not, perhaps you could write a data > step > instead. > > Regards, > > Kevin > > Kevin Viel > Georgia Medical Care Foundation > 57 Executive Park South, NE > suite 200 > Atlanta, GA 30329-2224 > > > > > -----Original Message----- > From: Laurel Copeland [mailto:lacop@UMICH.EDU] > Sent: Wednesday, September 26, 2001 9:49 AM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Speed of =: vs. SUBSTR()... > > > What would be fastest for a set of substring values? E.g., still with the > 5-character diagnostic codes, I want to pull off cases with any diagnosis > beginning with, say, 290, 295, 297, or 298. Running through 10 fields on > a > tape containing a few million records, I could use (SAS 6.09, MVS): > > Proc SQL; > CREATE table work.REG88a > AS SELECT PM.SCRSSN, > PM.DXLSF,PM.DXF2,PM.DXF3,PM.DXF4,PM.DXF5, > PM.DXF6,PM.DXF7,PM.DXF8,PM.DXF9,PM.DXF10 > FROM pm88.pm88 AS pm > WHERE substr(PM.DXLSF,1,3) IN('290','295','297','298') > OR substr(PM.DXF2,1,3) IN('290','295','297','298') > OR substr(PM.DXF3,1,3) IN('290','295','297','298') > OR substr(PM.DXF4,1,3) IN('290','295','297','298') > OR substr(PM.DXF5,1,3) IN('290','295','297','298') > OR substr(PM.DXF6,1,3) IN('290','295','297','298') > OR substr(PM.DXF7,1,3) IN('290','295','297','298') > OR substr(PM.DXF8,1,3) IN('290','295','297','298') > OR substr(PM.DXF9,1,3) IN('290','295','297','298') > OR substr(PM.DXF10,1,3) IN('290','295','297','298'); > > Is this my best bet? > Laurel Copeland > > At 09:24 AM 9/26/01 -0400, kviel wrote: > > Wim, > > I apologize. I would like to modify my previous suggestion. If, > indeed, we are > interested in this effect: > > if substr(c(i),1,1)='8' then d=1; > > Then this should really interest us: > > 87 data _null_; > 88 z01='80000'; > 89 do i=1 to 1e7; > 90 if substr(z01,1,1)='8' then d=1; > 91 end; > 92 run; > > NOTE: DATA statement used: > real time 3.98 seconds > cpu time 3.98 seconds > > > 93 > 94 data _null_; > 95 z01='80000'; > 96 do i=1 to 1e7; > 97 if z01=:'8' then d=1; > 98 end; > 99 run; > > NOTE: DATA statement used: > real time 0.27 seconds > cpu time 0.27 seconds > > In other words, on a Windows NT machine running v8.1, the substr() > function > was not as fast as the colon.


Back to: Top of message | Previous page | Main SAS-L page