LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2007, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Sat, 10 Nov 2007 21:20:59 -0700
Reply-To:     Alan Churchill <savian001@GMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Alan Churchill <savian001@GMAIL.COM>
Subject:      Re: pass-through data set once, and once only. Why
Comments: To: "sophe88@yahoo.com" <sophe88@YAHOO.COM>
In-Reply-To:  <1194753641.679178.265720@c30g2000hsa.googlegroups.com>
Content-Type: text/plain; charset="iso-8859-1"

Here's my opinion on this:

1. Lots of variables is a really, really bad idea (did I emphasize really enough?). Each one expands the dataset by that variable * the number of records (people can clarify the actual system impact but that is a general way of looking at it). SAS is much better at going deep than wide as is most databases.

2. Generally speaking, passing through the data once is a good idea but not at the sake of massive complexity. What you seem to have here is a general idea taken to the extreme where it no longer makes sense and is detrimental.

3. As you note, there is no restart logic possible. That is bad.

4. The model will only utilize multiple CPUs if you architect it that way. It is more than feasible to make that happen either in Base or MPConnect

5. Look at your system architecture. Make sure the reads/writes happen on separate volumes and that work is allocated on a scratch disk. There are loads of good papers that SAS has published on this area. I have found many of the minor options tweaks won't get you nearly as much as better processing logic.

6. Get some system measurements. My guess is that you are thrashing your systems.

Hope that helps. Some other performance whiz on here will chime in but that is my experience.

Alan

Alan Churchill Savian www.savian.net

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of sophe88@yahoo.com Sent: Saturday, November 10, 2007 9:01 PM To: SAS-L@LISTSERV.UGA.EDU Subject: pass-through data set once, and once only. Why

My friend Steve left the company. His boss, who is also my boss, asked me to take over his group and combine it with mine. I began to review all the subgroups he has.

One group, a business analyst, a (SAS) programmer and senior statistician, is to score 32 models each month, against >220 mm records and produce all the reports, QC.. .

After reading all the ~1000 original variables (only 303 needed for the models, but they input all variables anyway), they use 32 %include to run all models one data step. To avoid variable collision, they recoded so that no single model uses names that others use. FICO is a good example. Each model uses FICO. Theare are ~ 40 FICO, FICOX, FICO2... Before Keep statement is finally imposed, there are 5219 variables carried in the data step. So they run it over 250 computers in a grid fashion, each scoring ~1mm records. Eventually they set the data sets back to one big file.

Apparently 5219 variables * >220 mm calls for that 250 computers. But if one piece fails, the whole process has to restart all over, which takes ~ 7 calendar days. Besides, I need to justify the cost for 250 computers.

The chief defender is the mega-programmer who has engieered the whole thing. She argued that to achieve high efficiency running SAS programs, one must make sure to pass through data sets as few times as possible. Given the >220 mm records, she had no better choice other than passing through the data set just once (over 250 computers that is 250 times, but OK)

I took the scoring program to a Unix box and test-ran the most complex model, just 1 model using 37 variables. I only keep the variables needed, and finished the job in 2 hours 21 minutes. I noticed I was only able to use 1 CPU although the Unix server has 6 (why?).

Now if I ask people to build the scoring in 32 modules with the 303 variable only input at the center (like a spoke or star struture), that does not need much work on the %include. That will probably eliminate 1-2 jobs, 250 computers... That certainly will pass through data sets many times. I do not have clear and whole idea about this "pass-through" efficiency argument. How respectable is it? Anyone who can consult me internally on the subject is suspected to carry some political bias, regardless what they have to say. So I thought maybe this is good place to ask. They are all hard workers with good skills. I don't want to lose them because of my under -knowledge of their design. Any feed back is greatly appreciated.

PD


Back to: Top of message | Previous page | Main SAS-L page