Date: Sun, 1 Nov 2009 15:54:24 -0500
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: Cartesian Join Efficiency
Content-Type: text/plain; charset="utf-8"
I have to wonder why your projects would require a Cartesian product. I often find that in linking large relational datasets on weak identifying keys a SQL compiler cannot improve a query and resorts to computing a Cartesian product in the background as a first step; nonetheless, as a rule I have a more limited objective in mind and hope the compiler can find a more efficient way to meet the objective.
SAS SQL offers a variety of ways to constrain a solution path to avoid a massive Cartesian product. The SAS-L Archives include many postings during the past decade by Paul Dorfman, Howard Schreier, Ian Whitlock, and Richard DeVenezia, to name a few, who have invented ways to extend or bypass the SAS SQL compiler and avoid its failures to optimize queries.
A solution that features all possible pairings of two relations (a Cartesian product along the lines of a coordinate system) may frame a problem but doesn't contribute analytic insights. Likely you have some form of reduction of a Cartesian product in mind as a solution. I'd recommend that you define what you are trying to achieve as an end result. From there you may be able to focus on a more efficient method for combining and selecting data in a way that will help you achieve that result.
The SAS macroprogram that you propose to substitute for a Cartesian product query in SAS SQL has a simpler implementation as a series of SAS Data step views with firstobs= and obs= dataset options on the LHS of SQL FROM clauses embedded in a UNION ALL query. A better solution for a database programming problem would limit the search space for a solution to something much smaller than a Cartesian product. For example, instead of joining all rows in R1 (1,000,000 tuple dataset) and R2 (100,000 tuple dataset) to yield R (100,000,000,000 tuple dataset), and then selecting tuples with R1.weight>130kg and R2.height>2m, an improved solution would subset R1 to tuples with those with the attribute weight>130kg and R2 to those with the attribute height>2m before joining the subsets. If one includes the subsetting operations in a WHERE clause, most versions of SAS SQL will generate an execution plan that promotes the subsetting operations to take effect during the reads of the dataset and thusly implement the improved solution. More complex query optimization problems, such as the same general problem as before but with a R1.weight>130kg OR R2.height>2m condition, require more complex optimization methods.
The SAS SQL compiler combines logical analysis with indexing and other means of placing bounds on solutions. SAS-L and other forums have a wealth of explanations and examples of query optimizations techniques. Take a look and ask questions about techniques that seem mysterious.
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of pigpigpig
Sent: Saturday, October 31, 2009 8:08 PM
Subject: Re: Cartesian Join Efficiency
On 10月30日, 下午11时05分, art...@NETSCAPE.NET (Arthur Tabachneck) wrote:
> It would help if you post some samle data for the two files and what
> you want the third (new) file to look like.
> On Oct 30, 2:10 pm, pigpigpig <pigzhu...@gmail.com> wrote:
> > Cartesian Join is quite frequently used in my projects.. � It is an
> > inefficient join, but I have to do it…because I have to match the data
> > in table a with the data in table b. � If table a has 10000 and table b
> > has 5000, then the table generated by the catesion join will have
> > 5000*10000=50,000,000 records.
> > This is very time consuming.. To save time, � what I have done is to
> > cut the observations of one table, and run the join parallelly in a
> > few SAS sessions (open like 5 SAS window, one of the window runs for
> > %cartesian_join � ( Low=0, high=1000); another one run ( Low=1000,
> > high=2000), .. etc..
> > I am posting my solution here to share with you all, feel free to
> > comment , I am not sure if this is a good solution..... I believe
> > there must be a better way to do this.. Any suggestion??
> > Thanks!!!!!
> > For example:
> > %macro � cartesian_join ( Low=0, high=1000)
> > Data partial_a;
> > set a;
> > if &low <_n_<=&high then output;
> > run;
> > Proc sql;
> > create table c_&high as select partial_a.* , b. * from partial_a, b;
> > quit;
> > proc append base=c_1 � data=c_&high ;
> > run;
> > %mend;
> > %cartesian_join � ( Low=0, high=1000);
> > %cartesian_join � ( Low=1000, high=2000);
> > %cartesian_join � ( Low=2000, high=3000);
> > %cartesian_join � ( Low=3000, high=4000);
> > %cartesian_join � ( Low=4000, high=5000);
> > %cartesian_join � ( Low=5000, high=6000);
> > %cartesian_join � ( Low=6000, high=7000);
> > %cartesian_join � ( Low=7000, high=8000);
> > %cartesian_join � ( Low=8000, high=9000);
> > %cartesian_join � ( Low=9000, high=10000);
> > The one of the best things in the world is sharing.................
Eg Table A --->
INPUT var1_A $ var2_A $ var3_A $;
1 Jane Jack
2 Kate Kitt
3 Jean Jim
Table b -->
INPUT var1_B $ var2_B $;
CREATE TABLE TEST AS SELECT A.*, B.* FROM A, B;
Data test is the final table that I want. Each row in A join with all
the observations in B once.
Such join takes very long time when i deal with big tables
(eg.million records in both A and B)
Therefore, I divide table a (as above ) into a few subsets, open a few
SAS session and run the job parallelly
eg. one sas session run subset from 0-1000:
> > %cartesian_join ( Low=0, high=1000);
Another sas session run subset from :
> > %cartesian_join ( Low=1000, high=2000);
so on so forth..
I am not sure if this is the best way of doing this.. I am wondering
if anyone has better idea run the Cartesian
Product join more efficiently.
Thanks a lot! Hope I am making myself clearer this time