LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (August 2005, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 10 Aug 2005 12:36:56 -0400
Reply-To:     "Dorfman, Paul" <paul.dorfman@FCSO.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Dorfman, Paul" <paul.dorfman@FCSO.COM>
Subject:      Re: small slices of big data sets
Comments: To: nwy@UALBERTA.CA

N,

600m (and with - I dare to assume - a pretty wide row) read sequentially from DB2 in 3 hours? That's pretty fast!!!

Now seriously, the structure if the data base should not be a mystery. Ask the 'experts' and they should tell you what it is, for without knowing it you cannot query the data base efficiently. Specifically, ask about the index over SE_START_DATE. I cannot conceive that the data base is not indexed on SE_START_DATE. If it is not, be sure to point it out to your management, for this is called wasting system resources. However, it is far more likely that not only the column is indexed but the data base is in addition partitioned by it, and the view the DBAs have created for you is merely a UNION ALL all the partitions together. Such a view ought to honor the index if its usage is deemed optimal by the DB2 optimizer. That in turn depends on how the query is passed to DB2 by the SAS DB2 libname engine. Ideally, the where clause DB2 receives on her end is thus:

where le_start_date between '2005-01-01' and '2005-01-31'

To check if this will work much faster than you have observed (as it should if there is an index, let alone partitioning, on le_start_date), try using the Pass-Thru instead of the libname engine. This will pass the query to DB2 in its native form. In other words, code:

proc sql ; connect to DB2 (database=xxx user=yyy pw=zzz... <ape from libname>) ; create table extract as select * from connection to db2 ( select * from THE_TABLE where le_start_date between '2005-01-01' and '2005-01-31' ) ; disconnect from DB2 ; quit ;

Since you tell the optimizer you want 1/25 of the data base, it will in all likelihood use the index (unless there is none, and again, this is very hard to believe).

That having been said, methinks the libname engine should be passing the query (as written by you) to DB2 in an optimizable form. Heed what options are included on the LIBNAME statement, specifically that DBINDEX=YES; however, the latter may or may not have an effect depending on other factors (DB2 may use an index even if DBINDEX=NO is set).

As to the cheating, you may try this one. Create a SAS data set:

data date_range ; do le_start_date = '01jan2005'd to '31dec2005'd ; output ; end ; run ;

proc sql ; create table extract as select * from <DB2LIBREF>.THE _TABLE (dbkey = le_start_date) db2 , date_range sas where db2.le_start_date = sas.le_start_date ; quit ;

This *usually* prompts DB2 to search the index for each value of le_start_date coming from the SAS "driver" file.

Crying is not an option. Good luck!

Kind regards ---------------- Paul M. Dorfman Jacksonville, FL ----------------

On Wed, 10 Aug 2005 09:32:21 -0600, nyiann <nwy@UALBERTA.CA> wrote:

>Hi all, > >The data system I am working one consists of SAS views on top of db2 >tables, or something like that. The 'expert' database managers designed >this system, and it's struvcture is a mystery. What I know is that >indexing data is not an option, and that using it is so slow (compared to >the old flat file method) I want to cry. > >Technically, the primary table is a data set of 600 million records (25 >years of data). I usually make extracts of a year, or a few months. The >current method of querying is as follows: > >where input("20050101",yymmdd8.) le se_start_date le > input("20050131",yymmdd8.); >where input("20050101",yymmdd8.) le se_start_date le > input("20051231",yymmdd8.); > >whether in sql or in a data step. The problem is, whether I am selecting >one month, or 5 years, it takes 3 hours to pass through the data. So >here's the question--are there any programming options that can allow me >to cheat this system? I thought 'firstobs' and 'obs' might do the trick >(since the data are sorted by time, and I can guess roughly how many >records there are mer month) but this doesn't save time. Any other >thoughts? In theory, hashing would work perfectly, but I can't think of >a way to do this given the data's current configuration--as a view that >apparently (the administrators tell me) can't be indexed. > >Thanks in advance, > >N


Back to: Top of message | Previous page | Main SAS-L page