LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2003, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 5 Sep 2003 08:30:10 -0400
Reply-To:     Ian Whitlock <WHITLOI1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Ian Whitlock <WHITLOI1@WESTAT.COM>
Subject:      Re: An efficiency question
Comments: cc: "john_m_wildenthal@BANKONE.COM"
          <IMCEAMAILTO-john+5Fm+5Fwildenthal+40BANKONE+2ECOM@WESTAT.com>
Content-Type: text/plain

In a private message, John agreed that SUBCAT should not be part of the primary index. He pointed out that his main contribution to the code was filling the array.

He also pointed out my fallacy in claiming that CATEGORY_A=number could store the information in CATEGORY_A_1, CATEGORY_A_2, etc. If there were only one VAR this would be true, but the loop is done 30 times so in principle there could be 30 different category A's set to 1. While this information might be bit mapped into one variable, it is not what I was suggesting. Strikeout the comment beginning "One more thing". I now know why I didn't raise the issue earlier.

IanWhitlock@westat.com

-----Original Message----- From: Ian Whitlock Sent: Thursday, September 04, 2003 3:44 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: An efficiency question

John,

I strongly agree with you that Dennis' approach is the right one and that it is important to separate the control data from the code.

As a minor point you say "Make the primary key all three fields." I think only ID and CAT should be in the primary index. Otherwise the system would not guarantee exclusion of mistakes like

id cat subcat 3566 A 1 3566 A 2

Or to put it another way SUBCAT should be a function of ID and CAT.

I wonder about the complexity. Dennis kept it very simple by using one large rectangular array with possible many empty entries. You have saved memory at the cost of more complex code. Have you done any tests to show that the added complexity actually does save execution TIME?

In any case, your code is instructive and the question with various answers provides a very good mini-tutorial on SAS programming.

One more thing that has bothered me throughout is the use of

Category_a_1 = 1 ... Category_b_3 = 1 ...

This is a horrible way to store the informationally equivalent (note the else's in Chris' code)

Category_a = number Category_b = number ...

I assume it is part of something that will eventually consume binary variables. But I wonder if the reduction from 2500 binary variables to 50 numeric indexes would not make it pay to do this construction at a later point in the system.

IanWhitlock@westat.com

-----Original Message----- From: John M. Wildenthal [mailto:john_m_wildenthal@BANKONE.COM] Sent: Thursday, September 04, 2003 2:28 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: An efficiency question

Dennis Diskin has the best approach in his last post - direct matrix lookups. I will assume, as he did, that your ids are all essentially numbers. If not, you can convert them to numbers on the fly as an exercise ;)

The coding need not be that involved. This email will demonstrate a straightforward solution to that problem. Code is UNTESTED.

First off, do you really want to maintain the list of 50+ categories, each with 50+ subcategories, with each of those 2500 elements having multiple triggering conditions, inside your code? If there are 3 conditions (on average) for each different subcategory, that would be 7500 relationships. I wouldn't want to be responsible for maintaining that list, particularly within production source code. Let us assume you have access to an RDBMS from SAS (at least ODBC, right?). You can place an appropriate table out there, something like (monospace font highly recommended)

id cat subcat 3566 A 1 4863 ZZ 33 9327 YS 56 3566 U 23

And so on. Make the primary key all three fields.

Having the data in an RDBMS like this means you can have others maintain all of the mappings via an ASP page or some other simple access mechanism. You know they will eventually want to change them, so let's make the future comfortable.

Now to the coding. We need to build a BIG matrix that says what is to be set by what values. It will require (COUNT(DISTINCT cat)) x (MAX(id) -MIN(id)+1) x 8 bytes of memory. For a range of 10,000 id values and 50 categories, that would be about 4MB, well within any reasonable limits. Multiplying the range by 10 (100,000 difference between MIN and MAX) means 40MB, which any modern PC should easily handle. Make the difference 1,000,000 and you will need a nice PC (>512MB RAM) to prevent paging.

Let's get on with it. I will assume the above table is called orcldata.cat_table:

PROC SQL STIMER; SELECT COUNT(DISTINCT cat), MAX(id), MIN(id) INTO :catcount, :maxarray, :minarray FROM orcldata.cat_table; %* this is needed to build the big lookup matrix; SELECT cat, MAX(subcat) INTO :catlist SEPARATED BY ' ', :arraysize SEPARATED BY ' ' FROM orcldata.cat_table GROUP BY cat; %* this is needed to build an ARRAY for each category; QUIT;

%MACRO arraylist(cats, sizes); %* create the ARRAY statements for each category; %LOCAL i token lval; %LET i = 1; %LET token = %SCAN(&cats,&i); %DO %WHILE (%LENGTH(&token)); %LET lval = %SCAN(&sizes,&i); ARRAY category_&token.[0:&lval] category_&token._0-category_&token._&lval (%EVAL(&lval+1)*0); %LET i = %EVAL(&i + 1); %LET token = %SCAN(&cats,&i); %END; %MEND;

%MACRO indexcalc(cats); %* perform the mapping from cat to the column number in the lookup matrix; %LOCAL i token; %LET i = 1; %LET token = %SCAN(&cats,&i); %DO %WHILE (%LENGTH(&token)); WHEN ("&token") ival = &i; %LET i = %EVAL(&i + 1); %LET token = %SCAN(&cats,&i); %END; %MEND;

%MACRO arrayassign(cats); %* do the actual lookup and assign the appropriate subcategory value, with bounds checking; %LOCAL i token; %LET i = 1; %LET token = %SCAN(&cats,&i); %DO %WHILE (%LENGTH(&token)); IF &minarray LE id LE &maxarray THEN category_&token.[lookups[&i,id]] = 1; %LET i = %EVAL(&i + 1); %LET token = %SCAN(&cats,&i); %END; %MEND;

%MACRO droplist(cats); %* drop unneeded variables ; %LOCAL i token; %LET i = 1; %LET token = %SCAN(&cats,&i); %DO %WHILE (%LENGTH(&token)); IF &minarray LE id LE &maxarray THEN category_&token._0 %LET i = %EVAL(&i + 1); %LET token = %SCAN(&cats,&i); %END; %MEND;

DATA old; ARRAY lookups[&catcount,&minarray:&maxarray] _TEMPORARY_ (%EVAL (&catcount*(&maxarray - &minarray + 1))*0); %* initialize all values to zero ; %arraylist(&catlist,&arraysize); SET new; eof = 0; IF (_N_ EQ 1) THEN DO WHILE (NOT eof); SET orcldata.cat_table END=eof; SELECT (cat); %indexcalc(&catlist) OTHERWISE ABORT ABEND; %* table must have been edited between PROC SQL execution and now; END; lookups(ival,id) = subcat; END; DO i = 1 TO 30; %arrayassign(&catlist) END; DROP eof ival i %droplist(&catlist); RUN;

This transmission may contain information that is privileged, confidential and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is STRICTLY PROHIBITED. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. Thank you.


Back to: Top of message | Previous page | Main SAS-L page