Date: Wed, 18 Dec 2002 14:03:12 -0800
Reply-To: Cassell.David@EPAMAIL.EPA.GOV
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "David L. Cassell" <Cassell.David@EPAMAIL.EPA.GOV>
Subject: Re: Indexing SAS datasets
Content-type: text/plain; charset=us-ascii
"John J Genzano, III" <jgenzano@GENZANO.COM> sagely replied [in part]:
> My "rules" for indexing SAS datasets are as follows:
>
> 1 -- Data update/replacement to read/reporting ratio is small
> 2 -- Indexed field is often/mostly used to subset the data
> 3 -- "Normal" selection is for 1/3 of dataset or less
Excellent advice. But...
Sometimes these are extremely conservative estimates. But it can
depend on the size and shape of the data. There are SUGI papers
(including
one of my own) on this very subject, and they do not always agree on the
cutoffs. They're rules of thumb (as John knows), not axioms. I have
worked with some good-sized data sets (1.6 to 5 Gigs) with about 13
sort-keys, and indexing, even when reading the entire dataset, saved
time. Improving the index/sort key by optimal packing (see my paper)
saved even more time.
> Don't forget, using the index as above will only save I/O's, not CPU
> (but that should decrease wall time), but at the price of disk space.
So
> if you are in an environment where you are charged for resources, you
> need to make sure the trade-off is worth it. I was teaching an
> efficiencies seminar in an organization that charged for disk space
but
> not I/O's, so indexing cost them money and saved them none.
Unfortunately true. One has to ask *what* needs to be optimized.
CPU usage? Disk space? Wallclock time? Programmer time?
David
--
David Cassell, CSC
Cassell.David@epa.gov
Senior computing specialist
mathematical statistician
|