```Date: Thu, 18 Mar 2004 15:52:03 -0500 Reply-To: Howard Schreier Sender: "SAS(r) Discussion" From: Howard Schreier Subject: Re: Super Size Dataset from Cartesian Join One can custom-code all sorts of things in the DATA step. My preference with this type of problem is to stick with SQL and use a trick or two to give the optimizer something to work with. I used Richard's code (found elsewhere in this thread) to generate test data: %let n = 60000; data points; do id = 1 to &n; x = ranuni(1); y = ranuni(1); output; end; run; %let maxdistance = 0.05; %let maxdistance_squared = %sysevalf (&maxdistance**2); Then I transformed the points into a grid with integer coordinates: data gridded; set points; xgrid = round(round(x,&maxdistance)/&maxdistance); ygrid = round(round(y,&maxdistance)/&maxdistance); run; Note that this grid has nothing to do with the grid mentioned in the initial problem statement. Here's the basic SQL solution: proc sql; create table sqlslow as select distinct come.id as i, go.id as j, (come.x-go.x)**2 + (come.y-go.y)**2 as distance_squared from gridded as come, gridded as go where come.id < go.id and calculated distance_squared <= &maxdistance_squared order by i, j; quit; The problem of course is that it has to form and evaluate about 1.8 billion point pairs. Now for the tricks. Set up a simple adjacency table: data offsets; do xoffset = -1 to 1; do yoffset = -1 to 1; output; end; end; run; Two grid squares are considered adjacent if their coordinates differ by no more than the the respective X and Y offset values. This definition of adjacency includes coincidence as well as diagonal adjacency (squares touching at corners only). The size of the squares was contrived so that any pair of points satisfying the distance condition must be in adjacent squares. That requirement can be rolled into the query: proc sql _method; create table sqlfast as select distinct come.id as i, go.id as j, (come.x-go.x)**2 + (come.y-go.y)**2 as distance_squared from (select * from gridded , offsets)as come , gridded as go where come.id < go.id and calculated distance_squared <= &maxdistance_squared and come.xgrid + xoffset = go.xgrid and come.ygrid + yoffset = go.ygrid order by i, j; quit; What does this do? It inserts a third table (OFFSETS) into the join, making the theoretical size of the join 9 times was it would otherwise be. It inserts two *redundant* conditions (the ones involving the XGRID and YGRID columns) into the WHERE clause. Nevertheless, it speeds things up considerably, because the added conditions are equalities, which is what the optimizer knows how to exploit. Setting &N to 6000, I found that it reduced processing time by about 75%. Richard's DATA step does better (about 88%). But considering the fact that he was coding a solution to the given problem while I was using a somewhat generic technique, that's more or less to be expected. On Thu, 11 Mar 2004 16:16:16 -0500, Howard Schreier wrote: >A couple of references for Sig's suggestion: > >"SQL Joins -- The Long and The Short of It" by Paul Kent >http://support.sas.com/techsup/technote/ts553.html >Search for "Nearest Neighbors problem" > >"Picking Up Where the SQL Optimizer Leaves Off" by Howard Schreier >http://www.nesug.org/Proceedings/nesug03/at/at008.pdf > >On Thu, 11 Mar 2004 15:58:56 -0500, Sigurd Hermansen >wrote: > >>Dewen: >>One of SQL's nice features allows a database programmer to constrain the >>number of rows being joined (Cartesian or otherwise), and do so on >something >>other than an exact match (as in a SAS Data step MERGE BY GROUP). One trick >>in query optimization is to find a rough measure of something (distance in >>your case), and use it to segment the look-up matrix. >> >>Unfortunately I don't have a prepackaged solution for you. I have done >>something similar for screening calendar dates (a one-dimensional problem), >>but have not worked on the two-dimensional extension. I'd guess that you >>could define a checkerboard of mile-square blocks and number them so a >>difference of x between two block numbers would guarantee that instances in >>two blocks would exceed one mile. You would then join all cells >> WHERE t1.blockNumber-t2..block < x >>This condition eliminates pairs that cannot have a distance apart smaller >>than one mile. >> >>In general the number of pairs of an NXN product has order O(N**2). You'll >>have to pare down the solution set to make the problem solvable. >>Sig >> >>-----Original Message----- >>From: Dewen Hou [mailto:Dewen.Hou@BLOCKBUSTER.COM] >>Sent: Thursday, March 11, 2004 2:38 PM >>To: SAS-L@LISTSERV.UGA.EDU >>Subject: Super Size Dataset from Cartesian Join >> >> >>Dear All, >> >>I have to use a Cartesian join on a dataset, which represents a grid area, >>to set up a look-up matrix table for each cell in the grid against all >other >>cells in the same grid file. Then I calculate the distance among all the >>cells, just to keep those with distance smaller than 1 mile. Please look at >>the codes below. >> >>The problem is if the nop.newgrid dataset has 60,000 cells, the Cartesian >>join table nop.dist_matrix_all will have 3.6 billion records (60,000 X >>60,000). I saw a temp dataset file as big as 78 GB in my local drive. None >>of my machines can reach the final nop.dist_matrix_all dataset. >> >>So, is there a way to work around this? >> >>Thank you so much! >> >>David Hou >> >>/********************************************************************/ >>/*Create a new distance for all &maxcells against all cells*/ >>/*******************************************************************/ >>proc sql; >>create table nop.dist_matrix_all as >>select a.cellid as cellid_m, a.X as X1, a.Y as Y1, >> b.cellid as cellid_all, b.X as X2, b.Y as Y2 >>from nop.newgrid. a, nop.newgrid. b; >>quit; >> >>/********************************************************************/ >>/*Calculate distance >>*/ >>/********************************************************************/ >>data nop.dist_matrix_all; >>set nop.dist_matrix_all; >>iLon1 = X1*3.14159/180; >>iLon2 = X2*3.14159/180; >>iLat1 = Y1*3.14159/180; >>iLat2 = Y2*3.14159/180; >> >>iDiffLon = iLon2 - iLon1; >>iDiffLat = iLat2 - iLat1; >> >>minNum = sqrt(sin(iDifflat/2)*sin(iDifflat/2) + cos(iLat1)* >> cos(iLat2)*sin(iDiffLon/2)*sin(iDiffLon/2)); >>result = 2*ArSin(min(1, minNum))*6370997; >>distance = result/1609.344; >>if distance < 1 and distance > 0; >>Keep cellid_m cellid_all distance; >>run; ```

Back to: Top of message | Previous page | Main SAS-L page