LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2003, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 1 Jul 2003 08:38:37 -0700
Reply-To:     Daniel Milkove <dmilkove@ERS.USDA.GOV>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Daniel Milkove <dmilkove@ERS.USDA.GOV>
Organization: http://groups.google.com/
Subject:      SQL joins
Content-Type: text/plain; charset=ISO-8859-1

The program code below works, but I am looking for a way to improve its execution speed and to learn a little more SQL (or at least something about what SQL cannot do). To make this concrete, records in dataset DL contain information about the number of loans made by a bank in a particular county. File SL2 has records for each county in which a bank has a branch office. Banks can make loans in counties where they have no branches, potentially at great distances by using mail, phone, or the Internet. So for each county where a bank made loans, the program finds the distance to the nearest county containing a branch of that bank. The formula to compute the distance variable (DistanceH) is actually much more complicated, involving latitude and longitude for the two counties. I replaced it by a short , meaningless formula that still involves variables from both input files.

My initial approach was fairly straightforward but very brute force. SQL created a large intermediate file (1-4 million records) by matching each county in which a bank made loans with every county containing a branch of the bank. Then I used Data steps and Proc Sort to compute distance between county pairs, to sort by bank, loan county, and distance to bank branch county, and to select the nearest bank branch county.

My next attempt, with code following, uses SQL to successfully do everything in one step. That is, it takes a bank/county loan record, finds all bank branch/county records for that bank, computes distances, selects the record having minimum distance, and writes an output record. I assumed that this approach would avoid creating and sorting a large intermediate file. However, this program only saved a little real time. Using Windows Explorer to monitor the Sas temporay directory showed that two large files were created during execution of the program. Evidently, the full outer join takes place and the file is sorted as before. I had hoped that the Group By statement would allow Sas to operate on one bank/loan county record at a time, but clearly that is wrong. Is there a way to achieve this? That is, read a record from DL, find all relevant records in SL2, compute distances, identify the smallest distance, write an output record, and move on to the next record in DL, thus large intermediate files? I looked a little at subqueries but thought that is not appropriate here since the formula for DistanceH uses information from both intermediate files. Is that true? Any suggestions? Thanks. Dan Milkove

Proc SQL ; Create Table DLSL2 As Select *, DL.FipscdD - SL2.FipscdS AS DistanceH From DL Left Join SL2 On DL.IDRSSD = SL2.SUMD9000 Group By IDRSSD, FipscdD Having DistanceH = Min(DistanceH) ; Quit ;


Back to: Top of message | Previous page | Main SAS-L page