Date: Sat, 10 Nov 2007 21:20:59 -0700
Reply-To: Alan Churchill <savian001@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Alan Churchill <savian001@GMAIL.COM>
Subject: Re: pass-through data set once, and once only. Why
In-Reply-To: <1194753641.679178.265720@c30g2000hsa.googlegroups.com>
Content-Type: text/plain; charset="iso-8859-1"
Here's my opinion on this:
1. Lots of variables is a really, really bad idea (did I emphasize really
enough?). Each one expands the dataset by that variable * the number of
records (people can clarify the actual system impact but that is a general
way of looking at it). SAS is much better at going deep than wide as is most
databases.
2. Generally speaking, passing through the data once is a good idea but not
at the sake of massive complexity. What you seem to have here is a general
idea taken to the extreme where it no longer makes sense and is detrimental.
3. As you note, there is no restart logic possible. That is bad.
4. The model will only utilize multiple CPUs if you architect it that way.
It is more than feasible to make that happen either in Base or MPConnect
5. Look at your system architecture. Make sure the reads/writes happen on
separate volumes and that work is allocated on a scratch disk. There are
loads of good papers that SAS has published on this area. I have found many
of the minor options tweaks won't get you nearly as much as better
processing logic.
6. Get some system measurements. My guess is that you are thrashing your
systems.
Hope that helps. Some other performance whiz on here will chime in but that
is my experience.
Alan
Alan Churchill
Savian
www.savian.net
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
sophe88@yahoo.com
Sent: Saturday, November 10, 2007 9:01 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: pass-through data set once, and once only. Why
My friend Steve left the company. His boss, who is also my boss, asked
me to take over his group and combine it with mine. I began to review
all the subgroups he has.
One group, a business analyst, a (SAS) programmer and senior
statistician, is to score 32 models each month, against >220 mm
records and produce all the reports, QC.. .
After reading all the ~1000 original variables (only 303 needed for
the models, but they input all variables anyway), they use 32
%include to run all models one data step. To avoid variable
collision, they recoded so that no single model uses names that others
use. FICO is a good example. Each model uses FICO. Theare are ~ 40
FICO, FICOX, FICO2... Before Keep statement is finally imposed, there
are 5219 variables carried in the data step. So they run it over 250
computers in a grid fashion, each scoring ~1mm records. Eventually
they set the data sets back to one big file.
Apparently 5219 variables * >220 mm calls for that 250 computers. But
if one piece fails, the whole process has to restart all over, which
takes ~ 7 calendar days. Besides, I need to justify the cost for 250
computers.
The chief defender is the mega-programmer who has engieered the whole
thing. She argued that to achieve high efficiency running SAS
programs, one must make sure to pass through data sets as few times as
possible. Given the >220 mm records, she had no better choice other
than passing through the data set just once (over 250 computers that
is 250 times, but OK)
I took the scoring program to a Unix box and test-ran the most complex
model, just 1 model using 37 variables. I only keep the variables
needed, and finished the job in 2 hours 21 minutes. I noticed I was
only able to use 1 CPU although the Unix server has 6 (why?).
Now if I ask people to build the scoring in 32 modules with the 303
variable only input at the center (like a spoke or star struture),
that does not need much work on the %include. That will probably
eliminate 1-2 jobs, 250 computers... That certainly will pass through
data sets many times. I do not have clear and whole idea about this
"pass-through" efficiency argument. How respectable is it? Anyone who
can consult me internally on the subject is suspected to carry some
political bias, regardless what they have to say. So I thought maybe
this is good place to ask. They are all hard workers with good skills.
I don't want to lose them because of my under -knowledge of their
design. Any feed back is greatly appreciated.
PD