LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2003, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 29 Jul 2003 15:12:11 -0400
Reply-To:     Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:      Re: How to improve SAS performance
Comments: To: Aaron Moynahan <aaron.moynahan@VERIZON.NET>
Content-Type: text/plain

I think of efficiency of SAS programs as something dependent on but distinct from good database design. You are preaching to another member of the choir when you advocate appropriate uses of 'stack data structures' (A.K.A. 'fact tables' in data warehousing). If you look up 'Relational Schemes' and '4GL' in Lex Jansen's excellent SUGI archives (see concurrent thread), you'll find prior SUGI papers on that subject.

Relational databases and fact tables have interesting properties that have nothing directly to do with intuitive appeal and simplicity of programming, but do tie into Dale's complaints about premature collapsing of data structures. Since a relational database scheme preserves critical information from data entry forms (DEF's), a database programmer can transform that information into any data structure an analyst might need.

Sig

-----Original Message----- From: Aaron Moynahan [mailto:aaron.moynahan@VERIZON.NET] Sent: Monday, July 28, 2003 4:59 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: How to improve SAS performance

In your measurement of efficiency are you including all the hours that would be required to understand and safely modify code that was written to take advantage of all the arcane cryptic undocumented quirks that exist in SAS, that over 99.99 in 100 SAS programmers are not aware of. If you are then I might take that bet.

I actually wanted to see if anyone wanted to talk about using stack data stucures in dataset programming where you have to compute measures for individuals based on setting events and summing, counting, or creating binary flags for other events within some time frame from an index event. In my opinion using these structures are both simple and efficient since you can usually do your processing in a single pass and your initializing between individuals just involves setting the top marker to zero and zeroing out your metrics. The single pass produces a one record per individual dataset that you can then use standard SAS procedures on to conduct further analysis and data reduction. A single pass over the large dataset with simple mined code that's easy to maintain and with no cryptic quirks would be my entry into the efficieny competition.

"Sigurd Hermansen" <HERMANS1@WESTAT.COM> wrote in message news:9B501B3774931C469BCCCC021BE537228EE620@remailnt2-re01.westat.com... > I'll bet a lot more than $1. that, given any interesting SAS program > performance problem, Paul can find and correct more inefficiencies in > SAS 'coding' than you or anyone else on the 'L. Master Ian, Michael, > many others, and I have tried to best him in the past. About the best > we can do is write SAS solutions that do not perform substantially > worse than Paul's. > I'll post another in a series of examples of that in the near future. > > As for the 'too eager consultant' taunt, I don't see that sticking to Paul. > If anything, Paul offers his outstanding solutions to SAS programming > problems too freely on SAS-L. In fact, I chided him not long back for > completing what amounted to a several day consulting assignment gratis > in an > hour. A more cunning consultant would offer only hints and > suggestions. > > Sig > > -----Original Message----- > From: Aaron Moynahan [mailto:aaron.moynahan@VERIZON.NET] > Sent: Friday, July 25, 2003 5:22 PM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Re: How to improve SAS performance > > > I would bet you either a $1 or a soda that the biggest bang for the > buck lies in finding inefficiencies in the SAS coding. > > > > > "Jack Hamilton" <JackHamilton@FIRSTHEALTH.COM> wrote in message > news:sf214552.056@SLCM02.firsthealth.com... > > It appears to me that Paul *did* look at I/O first. > > > > You didn't give enough information for him, or anyone else, to say > > much more than he did without going waaay beyond his data. > > > > I don't understand this recent trend to insult people who are giving > > answers. If you don't like what you hear, clarify what you want, or > > just ignore the answer. > > > > > > > > > > -- > > JackHamilton@FirstHealth.com > > Manager, Technical Development > > Metrics Department, First Health > > West Sacramento, California USA > > > > >>> "Aaron Moynahan" <aaron.moynahan@VERIZON.NET> 07/25/2003 1:41 PM > > >>> > > Paul Dorfman you are a meathead. > > > > The problem as it was stated without details claimed that whatever > > the unknown parameters are this process is taking days to run. If > > the fist thing you do is something other than to look at file IO > > then you neither understand SAS or computing in general. > > > > Two jobs ago I fixed a problem like this for a large investment > > company that was running a process to evaluate their director > > marketing programs. The > > process had expanded and was taking 5-6 days to run. What had happened > > was > > that a statistician wrote a macro that computed metrics one at a time > > for a > > small dataset. The macro worked fine on a little sample but it just did > > not > > scale vary well. For this application I solved the problem > > understanding > > that the desired statistics could be computed with summary data and > > also > > that you could compute multiple metrics in a single procedure call. > > This > > simple fix cut the process down to about 1 day. > > > > I wonder what the consultant from hell would have come up with. > > > > > > > > I'm having problems with SAS performance because it takes days to > > > > > finish some processes. Is this common with large datasets? > > > > > > > > > > I'm working with datasets of about 20 million rows and many > > columns on > > > > > a windows NT platform. > > > > > > > > "Paul Dorfman" <paul_dorfman@HOTMAIL.COM> wrote in message > > news:BAY2-F60icL4UdmWrPc00004dec@hotmail.com... > > > >From: Aaron Moynahan <aaron.moynahan@VERIZON.NET> > > > > > > > >20 million records is a lot for a PC. > > > > > > Aaron, > > > > > > Unconstrained, this statement is about as informative as "the > > > average > > body > > > temperature across the hospital is 37.0 Centigrade". Much depends > > > on > > the > > > record length, what is to be done, and system capacity and/or > > configuration. > > > For example, I started this response, then submitted the stuff > > > shown > > in > > the > > > log excerpt below, and kept typing. I could not progress too far > > before > > the > > > job was finished (admittedly, I am a lousy typist <g>). All this > > > is > > done > > on > > > a 2*(933MHz PIII) PC clone under XP Pro with 1 GB of RAM running > > > SAS > > V9.1. > > > In other words, quite a slouch by today's measures. The only > > > performance-enhancing thingy here is I/O to a physically separate > > 30GB > > disk. > > > > > > 74 libname user 'h:\' ; > > > NOTE: Libref USER was successfully assigned as follows: > > > Engine: V9 > > > Physical Name: h:\ > > > 75 data a ; > > > 76 retain a 1 b 2 c 3 d 4 ; > > > 77 do id = 1 to 10 ; > > > 78 do a = 1 to 2e6 ; > > > 79 output ; > > > 80 end ; > > > 81 end ; > > > 82 run ; > > > > > > NOTE: The data set USER.A has 20000000 observations and 5 > > > variables. > > > NOTE: DATA statement used (Total process time): > > > real time 26.17 seconds > > > user cpu time 4.62 seconds > > > system cpu time 9.75 seconds > > > Memory 88k > > > > > > 83 proc means sum ; > > > 84 class id ; > > > 85 run ; > > > > > > NOTE: There were 20000000 observations read from the data set > > USER.A. > > > NOTE: PROCEDURE MEANS used (Total process time): > > > real time 40.09 seconds > > > user cpu time 44.26 seconds > > > system cpu time 9.76 seconds > > > Memory 6360k > > > > > > This indicates that a simple analysis on a 20-million strong file > > > is > > not > > > beyond the capabilities provided by an average PC equipped with at > > least > > two > > > physical drives, one being reserved strictly for SAS I/O. I would > > even > > > assert that the latter is a must for any PC anticipated to do a > > > fair > > amount > > > of pure I/O. > > > > > > Of course, even with the extra drive, the picture would be quite > > different > > > if I had 200 variables in the file, and/or had to reorder the file > > > incessantly back and forth (which would indicate either too much > > > stream-of-consciousness programming and/or poor design), or if the > > files > > > were heavily shared in a multi-user syste. Besides, intensive I/O > > > is > > not > > > what most preconfigured PC are sold for. In other words, when you > > then go > > on > > > to say > > > > > > >The best advice that I can give you is > > > >to try to minimize the the number of times that SAS has to read > > > >and > > write > > > >the data. If for instance, you are computing metrics for > > > >individual > > people > > > >depending on the problem you might benefit by initially sorting > > > >the > > data > > by > > > >person id and date. If you do this you can probably design > > > >programs > > using > > > >array processing the don't require you do so many passes of the > > entire > > > >dataset. Once you get a 1 record per person dataset you can do > > > >you > > data > > > >analysis with standard sas procedures. In addition, another > > > >approach > > is > > to > > > >make smaller datasets that meet very specific criteria and then > > compute > > > >your > > > >metrics and string together these data sets into one or more > > datasets for > > > >you final analysis. > > > > > > I strongly concur on ALL of these succinctly presented points, > > > every > > single > > > one hitting the respective nail rignt on the head. > > > > > > Kind regards, > > > =============== > > > Paul M. Dorfman > > > Jacksonville, FL > > > =============== > > > > > > > >"Paolo ORIFICI" <Orifici@LACAJA.COM.AR> wrote in message > > > >news:sf20ff1b.001@lacaja.com.ar... > > > > > HI list members, > > > > > > > > > > I'm having problems with SAS performance because it takes days > > to > > > > > finish some processes. Is this common with large datasets? > > > > > > > > > > I'm working with datasets of about 20 million rows and many > > columns on > > > > > a windows NT platform. > > > > > > > > > > My question is what is the best and most performing > > > > > architecture > > to > > > > > sort and work with large datasets. I was asking myself if > > > > > using > > Unix > > or > > > > > Linux is better than windows NT, if it depends on the server > > (we > > > > > doubled the hard disk memory but it didn't make substancial > > difference) > > > > > , if we need more processor, etc. > > > > > > > > > > Any experience sharing or suggestion is very welcomed. > > > > > > > > > > TIA, > > > > > > > > > > Paolo > > > > > > _________________________________________________________________ > > > Add photos to your e-mail with MSN 8. Get 2 months FREE*. > > > http://join.msn.com/?page=features/featuredemail


Back to: Top of message | Previous page | Main SAS-L page