Date: Fri, 1 May 2009 13:56:28 -0400
Reply-To: Michael Raithel <michaelraithel@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Michael Raithel <michaelraithel@WESTAT.COM>
Subject: Re: basic questions about index a large SAS file
Content-Type: text/plain; charset="us-ascii"
Christine posted the following interesting question:
> Hello all,
> I have a very large dataset (20 million obs, 101 variables)
> with information about states, counties within a state, and
> indviduals within the counties. I am frequently running proc
> means with a class statement. Sometimes I run the means by
> state (using the class not the by statement) and sometimes I
> run the means by state and county (again using the class statement).
> My first question is would indexing the dataset by state and
> county improve the speed at which SAS runs the means with the
> class option?
> If indexing is the way to go, then is this the proper SAS syntax:
> proc datasets library=mylib;
> modify MYFILENAME ;
> index create state;
> index create county;
> Note that you need both the state and county variables to
> unqiuely indentify a county.
Christine, from what you have written, I would bet that indexing by STATE and COUNTY would be a very effective option. But, it is important to note that one of the laws of how effective an index is revolves around its degree of being discriminant. Quoting from my award-winning book, The Complete Guide to SAS Indexes, Chapter 3, section: Proposed Index Key Variable Discriminant, Page 28, first paragraph --with the author's permission, of course:
"Index key variables should be discriminant. A discriminant variable is one with values very specific to a small set of observations within a data set."
So, if specifying STATE/COUNTY leads to relatively small sets of observations in your 20 million obs SAS data set, you are definitely barking up the right tree with indexes.
I would suggest considering a composite index, if you will always be using STATE and COUNTY together in your WHERE expressions. This might look like this:
proc datasets library=mylib;
modify MYFILENAME ;
index create state_county=(state county);
This will create the state_county index for the MYFILENAME SAS data set in the MYLIB SAS data library.
Then, your PROC SUMMARY might look something like this:
proc summary nway data=mylib.myfilename(where=(state="MD" and county = "Montgomery));
class state county;
var pop1 pop2 pop3;
output out=sum_state_county sum=;
SAS will recognize that there is a composite index built on mylib.myfilename and determine if using it to return observations is less costly than a sequential read of the entire SAS data set. If so, then presto-chango, ALAKAZAM! SAS uses your composite index and saves you computer resources.
Christine, best of luck in all of your SAS endeavors!
I hope that this suggestion proves helpful now, and in the future!
Of course, all of these opinions and insights are my own, and do not reflect those of my organization or my associates. All SAS code and/or methodologies specified in this posting are for illustrative purposes only and no warranty is stated or implied as to their accuracy or applicability. People deciding to use information in this posting do so at their own risk.
Michael A. Raithel
"The man who wrote the book on performance"
Author: Tuning SAS Applications in the MVS Environment
Author: Tuning SAS Applications in the OS/390 and z/OS Environments, Second Edition
Author: The Complete Guide to SAS Indexes
The point of philosophy is to start with something so simple
as not to seem worth stating, and to end with something so
paradoxical that no one will believe it. - Bertrand Russell