Date: Tue, 29 Jul 2003 15:12:11 -0400
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: How to improve SAS performance
Content-Type: text/plain
I think of efficiency of SAS programs as something dependent on but distinct
from good database design. You are preaching to another member of the choir
when you advocate appropriate uses of 'stack data structures' (A.K.A. 'fact
tables' in data warehousing). If you look up 'Relational Schemes' and '4GL'
in Lex Jansen's excellent SUGI archives (see concurrent thread), you'll find
prior SUGI papers on that subject.
Relational databases and fact tables have interesting properties that have
nothing directly to do with intuitive appeal and simplicity of programming,
but do tie into Dale's complaints about premature collapsing of data
structures. Since a relational database scheme preserves critical
information from data entry forms (DEF's), a database programmer can
transform that information into any data structure an analyst might need.
Sig
-----Original Message-----
From: Aaron Moynahan [mailto:aaron.moynahan@VERIZON.NET]
Sent: Monday, July 28, 2003 4:59 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: How to improve SAS performance
In your measurement of efficiency are you including all the hours that would
be required to understand and safely modify code that was written to take
advantage of all the arcane cryptic undocumented quirks that exist in SAS,
that over 99.99 in 100 SAS programmers are not aware of. If you are then I
might take that bet.
I actually wanted to see if anyone wanted to talk about using stack data
stucures in dataset programming where you have to compute measures for
individuals based on setting events and summing, counting, or creating
binary flags for other events within some time frame from an index event. In
my opinion using these structures are both simple and efficient since you
can usually do your processing in a single pass and your initializing
between individuals just involves setting the top marker to zero and zeroing
out your metrics. The single pass produces a one record per individual
dataset that you can then use standard SAS procedures on to conduct further
analysis and data reduction. A single pass over the large dataset with
simple mined code that's easy to maintain and with no cryptic quirks would
be my entry into the efficieny competition.
"Sigurd Hermansen" <HERMANS1@WESTAT.COM> wrote in message
news:9B501B3774931C469BCCCC021BE537228EE620@remailnt2-re01.westat.com...
> I'll bet a lot more than $1. that, given any interesting SAS program
> performance problem, Paul can find and correct more inefficiencies in
> SAS 'coding' than you or anyone else on the 'L. Master Ian, Michael,
> many others, and I have tried to best him in the past. About the best
> we can do is write SAS solutions that do not perform substantially
> worse than
Paul's.
> I'll post another in a series of examples of that in the near future.
>
> As for the 'too eager consultant' taunt, I don't see that sticking to
Paul.
> If anything, Paul offers his outstanding solutions to SAS programming
> problems too freely on SAS-L. In fact, I chided him not long back for
> completing what amounted to a several day consulting assignment gratis
> in
an
> hour. A more cunning consultant would offer only hints and
> suggestions.
>
> Sig
>
> -----Original Message-----
> From: Aaron Moynahan [mailto:aaron.moynahan@VERIZON.NET]
> Sent: Friday, July 25, 2003 5:22 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: How to improve SAS performance
>
>
> I would bet you either a $1 or a soda that the biggest bang for the
> buck lies in finding inefficiencies in the SAS coding.
>
>
>
>
> "Jack Hamilton" <JackHamilton@FIRSTHEALTH.COM> wrote in message
> news:sf214552.056@SLCM02.firsthealth.com...
> > It appears to me that Paul *did* look at I/O first.
> >
> > You didn't give enough information for him, or anyone else, to say
> > much more than he did without going waaay beyond his data.
> >
> > I don't understand this recent trend to insult people who are giving
> > answers. If you don't like what you hear, clarify what you want, or
> > just ignore the answer.
> >
> >
> >
> >
> > --
> > JackHamilton@FirstHealth.com
> > Manager, Technical Development
> > Metrics Department, First Health
> > West Sacramento, California USA
> >
> > >>> "Aaron Moynahan" <aaron.moynahan@VERIZON.NET> 07/25/2003 1:41 PM
> > >>>
> > Paul Dorfman you are a meathead.
> >
> > The problem as it was stated without details claimed that whatever
> > the unknown parameters are this process is taking days to run. If
> > the fist thing you do is something other than to look at file IO
> > then you neither understand SAS or computing in general.
> >
> > Two jobs ago I fixed a problem like this for a large investment
> > company that was running a process to evaluate their director
> > marketing programs. The
> > process had expanded and was taking 5-6 days to run. What had happened
> > was
> > that a statistician wrote a macro that computed metrics one at a time
> > for a
> > small dataset. The macro worked fine on a little sample but it just did
> > not
> > scale vary well. For this application I solved the problem
> > understanding
> > that the desired statistics could be computed with summary data and
> > also
> > that you could compute multiple metrics in a single procedure call.
> > This
> > simple fix cut the process down to about 1 day.
> >
> > I wonder what the consultant from hell would have come up with.
> >
> >
> >
> > I'm having problems with SAS performance because it takes days to
> > > > > finish some processes. Is this common with large datasets?
> > > > >
> > > > > I'm working with datasets of about 20 million rows and many
> > columns on
> > > > > a windows NT platform.
> >
> >
> >
> > "Paul Dorfman" <paul_dorfman@HOTMAIL.COM> wrote in message
> > news:BAY2-F60icL4UdmWrPc00004dec@hotmail.com...
> > > >From: Aaron Moynahan <aaron.moynahan@VERIZON.NET>
> > > >
> > > >20 million records is a lot for a PC.
> > >
> > > Aaron,
> > >
> > > Unconstrained, this statement is about as informative as "the
> > > average
> > body
> > > temperature across the hospital is 37.0 Centigrade". Much depends
> > > on
> > the
> > > record length, what is to be done, and system capacity and/or
> > configuration.
> > > For example, I started this response, then submitted the stuff
> > > shown
> > in
> > the
> > > log excerpt below, and kept typing. I could not progress too far
> > before
> > the
> > > job was finished (admittedly, I am a lousy typist <g>). All this
> > > is
> > done
> > on
> > > a 2*(933MHz PIII) PC clone under XP Pro with 1 GB of RAM running
> > > SAS
> > V9.1.
> > > In other words, quite a slouch by today's measures. The only
> > > performance-enhancing thingy here is I/O to a physically separate
> > 30GB
> > disk.
> > >
> > > 74 libname user 'h:\' ;
> > > NOTE: Libref USER was successfully assigned as follows:
> > > Engine: V9
> > > Physical Name: h:\
> > > 75 data a ;
> > > 76 retain a 1 b 2 c 3 d 4 ;
> > > 77 do id = 1 to 10 ;
> > > 78 do a = 1 to 2e6 ;
> > > 79 output ;
> > > 80 end ;
> > > 81 end ;
> > > 82 run ;
> > >
> > > NOTE: The data set USER.A has 20000000 observations and 5
> > > variables.
> > > NOTE: DATA statement used (Total process time):
> > > real time 26.17 seconds
> > > user cpu time 4.62 seconds
> > > system cpu time 9.75 seconds
> > > Memory 88k
> > >
> > > 83 proc means sum ;
> > > 84 class id ;
> > > 85 run ;
> > >
> > > NOTE: There were 20000000 observations read from the data set
> > USER.A.
> > > NOTE: PROCEDURE MEANS used (Total process time):
> > > real time 40.09 seconds
> > > user cpu time 44.26 seconds
> > > system cpu time 9.76 seconds
> > > Memory 6360k
> > >
> > > This indicates that a simple analysis on a 20-million strong file
> > > is
> > not
> > > beyond the capabilities provided by an average PC equipped with at
> > least
> > two
> > > physical drives, one being reserved strictly for SAS I/O. I would
> > even
> > > assert that the latter is a must for any PC anticipated to do a
> > > fair
> > amount
> > > of pure I/O.
> > >
> > > Of course, even with the extra drive, the picture would be quite
> > different
> > > if I had 200 variables in the file, and/or had to reorder the file
> > > incessantly back and forth (which would indicate either too much
> > > stream-of-consciousness programming and/or poor design), or if the
> > files
> > > were heavily shared in a multi-user syste. Besides, intensive I/O
> > > is
> > not
> > > what most preconfigured PC are sold for. In other words, when you
> > then go
> > on
> > > to say
> > >
> > > >The best advice that I can give you is
> > > >to try to minimize the the number of times that SAS has to read
> > > >and
> > write
> > > >the data. If for instance, you are computing metrics for
> > > >individual
> > people
> > > >depending on the problem you might benefit by initially sorting
> > > >the
> > data
> > by
> > > >person id and date. If you do this you can probably design
> > > >programs
> > using
> > > >array processing the don't require you do so many passes of the
> > entire
> > > >dataset. Once you get a 1 record per person dataset you can do
> > > >you
> > data
> > > >analysis with standard sas procedures. In addition, another
> > > >approach
> > is
> > to
> > > >make smaller datasets that meet very specific criteria and then
> > compute
> > > >your
> > > >metrics and string together these data sets into one or more
> > datasets for
> > > >you final analysis.
> > >
> > > I strongly concur on ALL of these succinctly presented points,
> > > every
> > single
> > > one hitting the respective nail rignt on the head.
> > >
> > > Kind regards,
> > > ===============
> > > Paul M. Dorfman
> > > Jacksonville, FL
> > > ===============
> > > >
> > > >"Paolo ORIFICI" <Orifici@LACAJA.COM.AR> wrote in message
> > > >news:sf20ff1b.001@lacaja.com.ar...
> > > > > HI list members,
> > > > >
> > > > > I'm having problems with SAS performance because it takes days
> > to
> > > > > finish some processes. Is this common with large datasets?
> > > > >
> > > > > I'm working with datasets of about 20 million rows and many
> > columns on
> > > > > a windows NT platform.
> > > > >
> > > > > My question is what is the best and most performing
> > > > > architecture
> > to
> > > > > sort and work with large datasets. I was asking myself if
> > > > > using
> > Unix
> > or
> > > > > Linux is better than windows NT, if it depends on the server
> > (we
> > > > > doubled the hard disk memory but it didn't make substancial
> > difference)
> > > > > , if we need more processor, etc.
> > > > >
> > > > > Any experience sharing or suggestion is very welcomed.
> > > > >
> > > > > TIA,
> > > > >
> > > > > Paolo
> > >
> > > _________________________________________________________________
> > > Add photos to your e-mail with MSN 8. Get 2 months FREE*.
> > > http://join.msn.com/?page=features/featuredemail
|