Date: Fri, 5 Sep 2003 08:30:10 -0400
Reply-To: Ian Whitlock <WHITLOI1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Ian Whitlock <WHITLOI1@WESTAT.COM>
Subject: Re: An efficiency question
Content-Type: text/plain
In a private message, John agreed that SUBCAT should not be part of the
primary index. He pointed out that his main contribution to the code was
filling the array.
He also pointed out my fallacy in claiming that CATEGORY_A=number could
store the information in CATEGORY_A_1, CATEGORY_A_2, etc. If there were
only one VAR this would be true, but the loop is done 30 times so in
principle there could be 30 different category A's set to 1. While this
information might be bit mapped into one variable, it is not what I was
suggesting. Strikeout the comment beginning "One more thing". I now know
why I didn't raise the issue earlier.
IanWhitlock@westat.com
-----Original Message-----
From: Ian Whitlock
Sent: Thursday, September 04, 2003 3:44 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: An efficiency question
John,
I strongly agree with you that Dennis' approach is the right one and that it
is important to separate the control data from the code.
As a minor point you say "Make the primary key all three fields." I think
only ID and CAT should be in the primary index. Otherwise the system would
not guarantee exclusion of mistakes like
id cat subcat
3566 A 1
3566 A 2
Or to put it another way SUBCAT should be a function of ID and CAT.
I wonder about the complexity. Dennis kept it very simple by using one
large rectangular array with possible many empty entries. You have saved
memory at the cost of more complex code. Have you done any tests to show
that the added complexity actually does save execution TIME?
In any case, your code is instructive and the question with various answers
provides a very good mini-tutorial on SAS programming.
One more thing that has bothered me throughout is the use of
Category_a_1 = 1 ... Category_b_3 = 1 ...
This is a horrible way to store the informationally equivalent (note the
else's in Chris' code)
Category_a = number Category_b = number ...
I assume it is part of something that will eventually consume binary
variables. But I wonder if the reduction from 2500 binary variables to 50
numeric indexes would not make it pay to do this construction at a later
point in the system.
IanWhitlock@westat.com
-----Original Message-----
From: John M. Wildenthal [mailto:john_m_wildenthal@BANKONE.COM]
Sent: Thursday, September 04, 2003 2:28 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: An efficiency question
Dennis Diskin has the best approach in his last post - direct matrix
lookups. I will assume, as he did, that your ids are all essentially
numbers. If not, you can convert them to numbers on the fly as an exercise
;)
The coding need not be that involved. This email will demonstrate a
straightforward solution to that problem. Code is UNTESTED.
First off, do you really want to maintain the list of 50+ categories, each
with 50+ subcategories, with each of those 2500 elements having multiple
triggering conditions, inside your code? If there are 3 conditions (on
average) for each different subcategory, that would be 7500 relationships. I
wouldn't want to be responsible for maintaining that list, particularly
within production source code. Let us assume you have access to an RDBMS
from SAS (at least ODBC, right?). You can place an appropriate table out
there, something like (monospace font highly recommended)
id cat subcat
3566 A 1
4863 ZZ 33
9327 YS 56
3566 U 23
And so on. Make the primary key all three fields.
Having the data in an RDBMS like this means you can have others maintain all
of the mappings via an ASP page or some other simple access mechanism. You
know they will eventually want to change them, so let's make the future
comfortable.
Now to the coding. We need to build a BIG matrix that says what is to be
set by what values. It will require (COUNT(DISTINCT cat)) x (MAX(id)
-MIN(id)+1) x 8 bytes of memory. For a range of 10,000 id values and 50
categories, that would be about 4MB, well within any reasonable limits.
Multiplying the range by 10 (100,000 difference between MIN and MAX) means
40MB, which any modern PC should easily handle. Make the difference
1,000,000 and you will need a nice PC (>512MB RAM) to prevent paging.
Let's get on with it. I will assume the above table is called
orcldata.cat_table:
PROC SQL STIMER;
SELECT COUNT(DISTINCT cat), MAX(id), MIN(id)
INTO :catcount, :maxarray, :minarray
FROM orcldata.cat_table; %* this is needed to build the big lookup
matrix;
SELECT cat, MAX(subcat)
INTO :catlist SEPARATED BY ' ', :arraysize SEPARATED BY ' '
FROM orcldata.cat_table
GROUP BY cat; %* this is needed to build an ARRAY for each category;
QUIT;
%MACRO arraylist(cats, sizes); %* create the ARRAY statements for each
category;
%LOCAL i token lval;
%LET i = 1;
%LET token = %SCAN(&cats,&i);
%DO %WHILE (%LENGTH(&token));
%LET lval = %SCAN(&sizes,&i);
ARRAY category_&token.[0:&lval] category_&token._0-category_&token._&lval
(%EVAL(&lval+1)*0);
%LET i = %EVAL(&i + 1);
%LET token = %SCAN(&cats,&i);
%END;
%MEND;
%MACRO indexcalc(cats); %* perform the mapping from cat to the column
number in the lookup matrix;
%LOCAL i token;
%LET i = 1;
%LET token = %SCAN(&cats,&i);
%DO %WHILE (%LENGTH(&token));
WHEN ("&token") ival = &i;
%LET i = %EVAL(&i + 1);
%LET token = %SCAN(&cats,&i);
%END;
%MEND;
%MACRO arrayassign(cats); %* do the actual lookup and assign the appropriate
subcategory value, with bounds checking;
%LOCAL i token;
%LET i = 1;
%LET token = %SCAN(&cats,&i);
%DO %WHILE (%LENGTH(&token));
IF &minarray LE id LE &maxarray THEN
category_&token.[lookups[&i,id]] = 1;
%LET i = %EVAL(&i + 1);
%LET token = %SCAN(&cats,&i);
%END;
%MEND;
%MACRO droplist(cats); %* drop unneeded variables ;
%LOCAL i token;
%LET i = 1;
%LET token = %SCAN(&cats,&i);
%DO %WHILE (%LENGTH(&token));
IF &minarray LE id LE &maxarray THEN
category_&token._0
%LET i = %EVAL(&i + 1);
%LET token = %SCAN(&cats,&i);
%END;
%MEND;
DATA old;
ARRAY lookups[&catcount,&minarray:&maxarray] _TEMPORARY_ (%EVAL
(&catcount*(&maxarray - &minarray + 1))*0); %* initialize all values to zero
;
%arraylist(&catlist,&arraysize);
SET new;
eof = 0;
IF (_N_ EQ 1) THEN
DO WHILE (NOT eof);
SET orcldata.cat_table END=eof;
SELECT (cat);
%indexcalc(&catlist)
OTHERWISE
ABORT ABEND; %* table must have been edited between PROC SQL
execution and now;
END;
lookups(ival,id) = subcat;
END;
DO i = 1 TO 30;
%arrayassign(&catlist)
END;
DROP eof ival i %droplist(&catlist);
RUN;
This transmission may contain information that is privileged, confidential
and/or exempt from disclosure under applicable law. If you are not the
intended recipient, you are hereby notified that any disclosure, copying,
distribution, or use of the information contained herein (including any
reliance thereon) is STRICTLY PROHIBITED. If you received this transmission
in error, please immediately contact the sender and destroy the material in
its entirety, whether in electronic or hard copy format. Thank you.