Date: Wed, 3 Sep 2003 13:16:28 0400
ReplyTo: Glenn Heagerty <gheagerty@ADELPHIA.NET>
Sender: "SAS(r) Discussion" <SASL@LISTSERV.UGA.EDU>
From: Glenn Heagerty <gheagerty@ADELPHIA.NET>
Organization: Consumer Solutions, Inc.
Subject: Re: An efficiency question
InReplyTo: <BAY1F117AagmQ5WTaY00014850@hotmail.com>
ContentType: text/plain; charset=usascii; format=flowed
Hi all,
I really should be working on a project, but since this came across last night
I've been hooked.
I generated a test data set as follows:
options compress=no;
data test(drop=i j);
array vars{30} $4 var1var30;
do i=1 to 3000000;
do j=1 to 30;
vars{j} = put(min(10000*round(ranuni(13), 0.0001), 9999), z4.);
end;
output;
end;
run;
I don't know how many unique subcategory codes Chris has, but the test data set
should suffice. The bummer is that despite all the different ways of setting the
flags (double loops, using OR, using index, using indexw, using direct
addressing, making assumptions about the categories, formats, etc.) nothing beat
the original code's processing time. Is this a feature of the test data set? I
don't know. Maybe using call peek would help, but I'm not familiar with this
function.
Now based on additional information provided by Chris and everyone's suggestions
I think we can quantitify the problem as follows:
1) Chris has 30 input variables containing subcategory codes
2) Chris needs to set over 2500 indicator variables
3) Chris has lots of subcategory codes  at least 2,500 maybe
4) Subcategory codes can belong to one or more categories
It seems like there is no other choice but to check each variable against the
list of subcategory codes for a given category and subcategory, any additional
statements added to check to see if the indicator variable has already been set
just adds to the time, and that it comes down to defining at least one array for
the input variables. So the task is to enter the subcategory code lists most
efficiently and use within a category (pick one):
* ifthenelse statements
* select statements
* formats
As Paul Choate suggested, if the frequency of subcategory codes is skewed
towards a few codes or categories have a dependence on other categories, than
that would drive the order of the lists.
Well, thanks for following along. Hopefully, it made some sense.
Glenn
Christopher Sadler wrote:
> Many thanks to everyone for the responses, but I made an embarrassing
> mistake in the example. That second line should have been 'else if' as
> I am
> already using else statements. This, of course, does improve the speed.
> However, I did a terrible job of providing an example of the true nature of
> why I'm looking for more ways to improve efficiency. The volume of
> comparisions is huge, as there are over 50 categories (more to come) and
> all
> their respective subcategories that need to be evaluated (again, this is to
> be run on datasets of at least 3 million records). Some categories
> overlap
> and some will end up being blank, having no value for any of the flag
> variables (so I think adding an 'until' clause as Paul suggested is not
> really an option). I was hoping there might be some means to eliminate
> the
> loop and evaluate the entire array against an 'in' at the same time (?)
>
> Again, many thanks,
> Chris
>
> A better example is this:
>
> data new;
> set old;
> array _vars_(30) var1var30;
>
> do i = 1 to 30;
> if _vars_(i) in ('3244','5664','6756') then category_a_1=1;
> else if _vars_(i) in ('6546','4433') then category_a_2=1;
> continue with about 50 more 'else if' statements for category a's
> subcategories.........
>
> if _vars_(i) in ('3423','5533,'5543') then category_b_1=1
> else if _vars_(i) in ('5344','3244','6443') then category_b_2=1;
> continue with about 50 more 'else if' statements for category b's
> subcategories.....
>
> continue with about 50 more categories.......
> end;
>
> run;
>
>
>
>
>> From: "Paul Dorfman" <paul_dorfman@hotmail.com>
>> ReplyTo: sashole@bellsouth.net
>> To: cjsadler@HOTMAIL.COM, SASL@LISTSERV.UGA.EDU
>> Subject: Re: An efficiency question
>> Date: Wed, 03 Sep 2003 06:16:08 +0000
>>
>> Chris,
>>
>> You are searching sequentially but not stopping short when the
>> condition is
>> satisfied and keep going all the way to i=30. This should improve the
>> loop
>> performance about twice *on the average*:
>>
>> do i = 1 to 30 until ( flag1 and flag2 ) ;
>> if _vars_(i) in ('3244','5664','6756') then flag1=1;
>> if _vars_(i) in ('6546','4433') then flag2=1;
>> end;
>>
>> Kind regards,
>> =================
>> Paul M. Dorfman
>> Jacksonville, FL
>> =================
>>
>
> _________________________________________________________________
> MSN 8: Get 6 months for $9.95/month http://join.msn.com/?page=dept/dialup
>
