Date: Wed, 10 Aug 2005 12:36:56 -0400
Reply-To: "Dorfman, Paul" <paul.dorfman@FCSO.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Dorfman, Paul" <paul.dorfman@FCSO.COM>
Subject: Re: small slices of big data sets
N,
600m (and with - I dare to assume - a pretty wide row) read sequentially
from DB2 in 3 hours? That's pretty fast!!!
Now seriously, the structure if the data base should not be a mystery. Ask
the 'experts' and they should tell you what it is, for without knowing it
you cannot query the data base efficiently. Specifically, ask about the
index over SE_START_DATE. I cannot conceive that the data base is not
indexed on SE_START_DATE. If it is not, be sure to point it out to your
management, for this is called wasting system resources. However, it is
far more likely that not only the column is indexed but the data base is
in addition partitioned by it, and the view the DBAs have created for you
is merely a UNION ALL all the partitions together. Such a view ought to
honor the index if its usage is deemed optimal by the DB2 optimizer. That
in turn depends on how the query is passed to DB2 by the SAS DB2 libname
engine. Ideally, the where clause DB2 receives on her end is thus:
where le_start_date between '2005-01-01' and '2005-01-31'
To check if this will work much faster than you have observed (as it
should if there is an index, let alone partitioning, on le_start_date),
try using the Pass-Thru instead of the libname engine. This will pass the
query to DB2 in its native form. In other words, code:
proc sql ;
connect to DB2 (database=xxx user=yyy pw=zzz... <ape from libname>) ;
create table extract as
select * from connection to db2 (
select * from THE_TABLE
where le_start_date between '2005-01-01' and '2005-01-31'
) ;
disconnect from DB2 ;
quit ;
Since you tell the optimizer you want 1/25 of the data base, it will in
all likelihood use the index (unless there is none, and again, this is
very hard to believe).
That having been said, methinks the libname engine should be passing the
query (as written by you) to DB2 in an optimizable form. Heed what options
are included on the LIBNAME statement, specifically that DBINDEX=YES;
however, the latter may or may not have an effect depending on other
factors (DB2 may use an index even if DBINDEX=NO is set).
As to the cheating, you may try this one. Create a SAS data set:
data date_range ;
do le_start_date = '01jan2005'd to '31dec2005'd ;
output ;
end ;
run ;
proc sql ;
create table extract as
select *
from <DB2LIBREF>.THE _TABLE (dbkey = le_start_date) db2
, date_range sas
where db2.le_start_date = sas.le_start_date
;
quit ;
This *usually* prompts DB2 to search the index for each value of
le_start_date coming from the SAS "driver" file.
Crying is not an option. Good luck!
Kind regards
----------------
Paul M. Dorfman
Jacksonville, FL
----------------
On Wed, 10 Aug 2005 09:32:21 -0600, nyiann <nwy@UALBERTA.CA> wrote:
>Hi all,
>
>The data system I am working one consists of SAS views on top of db2
>tables, or something like that. The 'expert' database managers designed
>this system, and it's struvcture is a mystery. What I know is that
>indexing data is not an option, and that using it is so slow (compared to
>the old flat file method) I want to cry.
>
>Technically, the primary table is a data set of 600 million records (25
>years of data). I usually make extracts of a year, or a few months. The
>current method of querying is as follows:
>
>where input("20050101",yymmdd8.) le se_start_date le
> input("20050131",yymmdd8.);
>where input("20050101",yymmdd8.) le se_start_date le
> input("20051231",yymmdd8.);
>
>whether in sql or in a data step. The problem is, whether I am selecting
>one month, or 5 years, it takes 3 hours to pass through the data. So
>here's the question--are there any programming options that can allow me
>to cheat this system? I thought 'firstobs' and 'obs' might do the trick
>(since the data are sorted by time, and I can guess roughly how many
>records there are mer month) but this doesn't save time. Any other
>thoughts? In theory, hashing would work perfectly, but I can't think of
>a way to do this given the data's current configuration--as a view that
>apparently (the administrators tell me) can't be indexed.
>
>Thanks in advance,
>
>N
|