Date: Tue, 1 Jul 2003 08:38:37 -0700
Reply-To: Daniel Milkove <dmilkove@ERS.USDA.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Daniel Milkove <dmilkove@ERS.USDA.GOV>
Organization: http://groups.google.com/
Subject: SQL joins
Content-Type: text/plain; charset=ISO-8859-1
The program code below works, but I am looking for a way to improve
its execution speed and to learn a little more SQL (or at least
something about what SQL cannot do). To make this concrete, records
in dataset DL contain information about the number of loans made by a
bank in a particular county. File SL2 has records for each county in
which a bank has a branch office. Banks can make loans in counties
where they have no branches, potentially at great distances by using
mail, phone, or the Internet. So for each county where a bank made
loans, the program finds the distance to the nearest county containing
a branch of that bank. The formula to compute the distance variable
(DistanceH) is actually much more complicated, involving latitude and
longitude for the two counties. I replaced it by a short ,
meaningless formula that still involves variables from both input
files.
My initial approach was fairly straightforward but very brute force.
SQL created a large intermediate file (1-4 million records) by
matching each county in which a bank made loans with every county
containing a branch of the bank. Then I used Data steps and Proc Sort
to compute distance between county pairs, to sort by bank, loan
county, and distance to bank branch county, and to select the nearest
bank branch county.
My next attempt, with code following, uses SQL to successfully do
everything in one step. That is, it takes a bank/county loan record,
finds all bank branch/county records for that bank, computes
distances, selects the record having minimum distance, and writes an
output record. I assumed that this approach would avoid creating and
sorting a large intermediate file. However, this program only saved a
little real time. Using Windows Explorer to monitor the Sas temporay
directory showed that two large files were created during execution of
the program. Evidently, the full outer join takes place and the file
is sorted as before. I had hoped that the Group By statement would
allow Sas to operate on one bank/loan county record at a time, but
clearly that is wrong. Is there a way to achieve this? That is, read
a record from DL, find all relevant records in SL2, compute distances,
identify the smallest distance, write an output record, and move on to
the next record in DL, thus large intermediate files? I looked a
little at subqueries but thought that is not appropriate here since
the formula for DistanceH uses information from both intermediate
files. Is that true? Any suggestions?
Thanks.
Dan Milkove
Proc SQL ;
Create Table DLSL2 As Select *, DL.FipscdD - SL2.FipscdS AS DistanceH
From DL Left Join SL2
On DL.IDRSSD = SL2.SUMD9000
Group By IDRSSD, FipscdD
Having DistanceH = Min(DistanceH)
;
Quit ;
|