Date: Fri, 9 Nov 2007 11:38:06 -0500
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: Which is quicker: view on view or several lookups?
In-Reply-To: <1194608493.926273.77270@o3g2000hsb.googlegroups.com>
Content-Type: text/plain; charset="us-ascii"
Sander:
I don't have any detail on how DI Studio works. Views in SAS PROC SQL
run behind the scene, so much depends on sources of data (Oracle, OLEdb,
DB2 SAS/Access engines as well as SAS datasets). Judging from what I
have seen from the outside, the SAS SQL compiler determines how data
pipe through a series of stored and in-line views. Views execute in
sequence, much like nested sub-queries, so SAS caches data in a way that
fits within an execution plan.
In SAS SQL the LEFT JOIN almost always forces sorting operations. Hash
object equivalents almost always work much quicker, often dramatically
so. Why not test what you are trying to do in SAS PROC SQL? The
invaluable SAS-L Archives contain many postings on SQL performance,
including the LEFT JOIN, hash object look-up and summarization, and data
transformations. Some of the implementations of declarative programs
come very close to the theoretical O(N) limit of performance on current
platforms.
If you are depending on DI to optimize declarative programs, you'll
pretty much have to settle for whatever execution plan DI writes. SAS
PROC SQL tests could show whether it makes sense to depend on DI, or
whether intervening might improve performance significantly.
S
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of Sander Burggraaff
Sent: Friday, November 09, 2007 6:42 AM
To: sas-l@uga.edu
Subject: Which is quicker: view on view or several lookups?
Example:
We have one dataset containing 10 million records. We also have 3 other
datasets that contain product information. Data of each of those 3
datasets has to be added to the 10 million records by not only a single
key variable but a date as well. So we have to check if a certain date
is between 2 dates in each record of those 3 datasets. Which method
would be better for using in DI Studio?
We can create 3 views with SQL left joins. View 2 would be based on view
1 and view 3 would be based on view 2. The idea behind this is that we
can reduce I/O operations by using views. SQL joins don't have a great
performance however.
Another way would be to use the lookup transformation in DI Studio which
uses hashing. This is a lot quicker than SQL join but the downside is
that we would be reading those 10 million records three times instead of
only once (!?!) when using the 3 views.
Now it's that last statement that has got us thinking. Will those 10
million records really be read only once when using views? We have our
doubts...
Unfortunately we haven't been able to find any information that
clarifies this point. We have found a thread in this newsgroup and we
read the SUGI 19-27 paper and we found out about the spill file but
we're still not convinced. Is there anyone with a definite answer to
this question?