Date: Fri, 20 Oct 2000 04:34:48 -0400
Reply-To: Gerhard Hellriegel <ghellrieg@T-ONLINE.DE>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Gerhard Hellriegel <ghellrieg@T-ONLINE.DE>
Subject: Re: Any MVS efficiency tips (for large files, etc.)?
On Fri, 20 Oct 2000 02:52:23 GMT, Brad <b_branford@HOTMAIL.COM> wrote:
>Hi,
>
>I'd like to compile a list of MVS efficiencies which allow for faster
>file processing when dealing with HUGE datasets. I'd appreciate it if
>the people on this newsgroup, with their extensive experience, would
>share their tips on various aspects of using SAS with large datasets.
>Once I have a list I'll post it for everybody's benefit.
>
>Thanks for sharing.
>
>Brad
>
>
>Sent via Deja.com http://www.deja.com/
>Before you buy.
There are many of that things!!
First of all, there is a SAS book with a lot of tips. This was written as
V6 was actual, but most of the tips are actual with V8.
There are many things, you can do with program logic. Often more efficient
than all other tuning tips. That is depending on the things you have to do.
Sometimes you have to try, e.g. it is sometimes more efficient to use PROC
APPEND instead of DATA ...; SET file1 file2;
On the other hand you can possibly avoid a sort with the second method. It
depends on the logic and the data, which is more efficient.
All in all: there are no tips which are "universal". Always all is
depending on the data, what you do with the data and where you do it.
The main thing is: CPU operations are fast, I/O operations are slow. So if
you can avoid I/O operations that will be much more worth than avoiding CPU
resource consumption.
First thing is the BLKSIZE for SAS datasets. Depending on your device and
on your data, half-track is the best choice (27648 for 3390). For special
applications that may be other. Depending on your data it is useful to try
it out, e.g. try a multiple of the record length, below half track.
If you have more than one dataset, put them each in another lib on another
device (have a look on the special volumes: how much traffic is there?).
If you can, use only primary allocation, allocating extents cost time for
the OS.
If you can, use hiperspace in expanded memory for temporary datasets!
Try out to work with compressed datasets. You can transport more data in
one EXCP with that (depending on the compression rate). The cost is a
higher CPU consumption.
You can experiment with the BUFNO option to get more buffers, but you will
see not big efforts. Between 4 and 10 buffers maybe it will be a bit faster
(elapsed time), with bigger numbers it's decreasing again, because the
overhead for the buffer-handling increases.
Most of that is true for DASD devices like 3380, 3390, ... But not for
HIPERSPACE in memory.
For big datasets and sorts you should always use external sort utilities,
like SYNCSORT, DFSORT, ... Also that utilities can use hiperspace for there
temporary buffers (SORT-WORK-datsets!).
Give the SAS region much memory. You can avoid i/o operations for SAS
modules and have more data in memory. SAS uses it and the efficiency of
many PROCs increases!
Programming tips there are too many to list them all. They are almost the
same as in other programming environments: use CPU, avoid IOs, use the
available memory...
In SAS it means: throw away all what you don't need as early as possible
(KEEP lists for input datasets, WHERE instead of IF, ...)
- avoid unnecessary steps (sorts, data-steps, ...)
- avoid the use of "mighty" PROCs if you can do it with a small DATA step:
some aggregations are much faster to do with a DATA step than with PROC
SUMMARY.
- keep the code in loops (remember: each DATA - step IS a kind of loop!) as
small as possible.
%let n=0;
data a.b;
set x.y;
call symput("n",_n_);
....
run;
brings you in the macro variable &n the number of obs in a.b (besides other
things).
Better:
%let n=0;
data a.b;
set x.y;
/* call symput("n",_n_); */
....
run;
data _null_;
set a.b nobs=n;
call symput("n",n);
stop;
run;
because in the first solution the call symput is executed as much as you
have obs in a.b. Ok, you could force it to be executed only once:
%let n=0;
data a.b;
set x.y nobs=n;
if _n_=1 then
call symput("n",n);
....
run;
but in this case you have the branch executed very often.
Ok, you can say you want MVS tips for big datasets without any thoughts
about programming. I mean, that is not the only right way!
The sort-problem is a good example: a quicksort or mergesort is always more
efficient as a simple bubblesort. Always? No, only for big datasets and
only if you do it more than once! I'd prefer always the bubblesort which I
can write in 5 minutes to sort a dataset once, even if I have to wait 1
hour until it's ready. If I need 2 hours to construct a sort which sorts my
dataset in 5 minutes it is for me only needful, if I can use it more than
once!
When I have to do something in SAS, I always use the "quick and dirty way"
first. If I have to do the same thing often, I try to optimize it a bit. If
I have to do it very often and it is important for me to get it fast, I
investigate time to optimize it more! Not if I'm payed for producing
results and it's cheaper for my site to buy a bigger machine than to pay me
for reducing the resource consumption!
In this environment I see the things above: e.g. using the half-track size
for DASD without any experiments to get out another millisecond is ok.
Using WHERE instead of IF is ok also and have some rules in the background
to make it not unnecessary inefficient.
A limit for me is for example: do I use the slow SASHELP - views, or do I
use a utility PROC (PROC CATALOG, DATASETS, ...) to get some infos. First I
ALWAYS use the SASHELP view. If I need that program in production and it is
running every day, I replace the information extraction with a faster
solution (if I have time). Only if someone tells me, that my program have
only 5 minutes to run, but it needs 10 minutes, I'll do that in a program
which runs once a month. So you see: always be careful with the expensive
resources, not with the cheap ones!
Gerhard
|