Date: Mon, 19 Jun 2006 13:29:41 -0400
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: Another question about indexes
Content-Type: text/plain; charset="us-ascii"
Erico:
Guess that you are referring to what I said about indexes in this
thread:
'As for 1., indexes on subsetting variables may speed up selection of
subsets, but will probably not be needed on the date and foo variables
...'
While I agree that indexing can in some situations greatly improve
efficiency of data access, I don't see that indexing has a very good
chance of speeding up searches in the situation given:
1. five different datasets that would have to be indexed individually or
read row by row into a dataset prior to indexing;
2. three search constraints on three different keys;
3. filters can be applied as the five datasets are being read initially.
If the three subsetting indexes already exist on the five datasets, then
SAS will likely use them as an alternative to sorting on three different
orders; nonetheless, creating the indexes for one application will
likely cost more than scanning and filtering of the five datasets.
Should the proportion of rows that elude the filter turn out to greater
than 20%, SAS will likely not use indexes in subsetting operations. I'd
start by using WHERE statements to subset each of the five datasets
individually, then combine the yields.
When datasets have many columns and an application defined on them
requires far fewer columns, then a KEEP option or SELECT clause in a
UNION of queries may also reduce I/O, sometimes dramatically. Only when
searches on subsidiary keys occur often would indexing of many datasets
improve efficiency.
Sig
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of Eric Eyre
Sent: Sunday, June 18, 2006 4:16 AM
To: sas-l@uga.edu
Subject: Re: Another question about indexes
Are you sure about that SH?
If the input is a sas data set (as opposed to a raw data file) as in
data two;
set one;
where x<5;
The WHERE will select qualifying logical obs for movement to the PDV
after the data has reached the page buffer in memory, thus ultimately
saving cpu resources, but no i/o.
Backing up a little bit to the i/o department, if DATA ONE residing in
disk storage has an index x on var x, and sas ends up using the index,
then sas will use the index to move only the PAGES of data that contain
qualifying logical obs into the page buffer in memory, thus conserving
i/o, since sas has fewer pages to load. This is commensurate with the
fact that sas data sets are arranged internally in pages and in sas page
size is conceptually and often literally aligned with buffer size.
So with judicious index use you can often get a 2fer - cpu savings from
the where, i/o savings from the index
Erico
"Scott Bass" <sas_l_739@yahoo.com.au> wrote in message
news:449161a9$0$7147$afc38c87@news.optusnet.com.au...
> Hi,
>
> Say I have datasets data_2001 data_2002 data_2003 data_2004 data_2005.
> I index them on the more commonly used subsetting variables. They are
> simple (not composite) indexes.
>
> 1. If a where clause is something like:
>
> where date between start and end and foo in ("foo", "bar", "blah") and
> code=5
>
> should I assume that the indexes will greatly speed up the where
> clause processing? Sorry I know this is a bit vague - does anyone
> know where to look in the doc for details on how indexes are used with
> where clauses?
>
> 2. If I create a data step view such as:
>
> data total / view=total;
> set data_2001 data_2002 data_2003 data_2004 data_2005; run;
>
> then use the same where clause as above, but on the DSV, will the
> indexes help performance?
>
> Thanks,
> Scott
>