Date: Fri, 11 Jul 2003 09:26:18 -0700
Reply-To: Daniel Milkove <dmilkove@ERS.USDA.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Daniel Milkove <dmilkove@ERS.USDA.GOV>
Subject: Re: SQL joins
Content-Type: text/plain; charset=ISO-8859-1
Input file DL has 17,950 records for 950 banks. Some banks make loans
in just 1 county and one bank has loans in 1,174 counties. Input file
SL2 has 25,699 records for 9,741 banks, 5,813 having offices in a
single county and one has offices in 574 counties. NPaths = 1,091,930
in this case (somewhere around 3.5 million for a second version of DL
that I also use).
Your "speculative quick plan of attack" is logical and looks
promising. The first part is very fast, finding 6,272 cases (out of
17,950) where the minimum distance is zero. My implementation of the
second part has not worked yet, but I will try some more. Somehow the
query is still working with all cases, not just the subset where the
minimum distance is positive.
"Richard A. DeVenezia" <firstname.lastname@example.org> wrote in message news:<bef9l3$4fa00$1@ID-168040.news.dfncis.de>...
> "Daniel Milkove" <email@example.com> wrote in message
> > Richard,
> > Thanks for your suggestion. I am always amazed how frequent
> > contributors to this group are willing to dig through increasingly
> > convoluted threads. Your code is easier to read due to the use of
> > longer, more descriptive names for datasets and variables.
> > Unfortunately when applied to my actual data your program took just as
> > long to execute since it is essentially the same code as I provided at
> > the end of my original post. Thus SQL still creates a large (300MB)
> > temporary file that must be sorted.
> > This did motivate me to try something that I should have done in the
> > first place, namely ignore the majority of variables in the Select
> > statement and then merge them back in later with additional SQL Select
> > statements. By limiting variables, the temporary file size was
> > reduced to around 50MB and this part of the program took less than 2
> > minutes rather than over 10 minutes.
> > If I had done this earlier I would probably not have been motivated to
> > investigate alternative methods. But now I am curious and will ask
> > one more time whether a subquery can do the following without
> > involving a full-scale outer join. Let the outer query provide one
> > record at a time. Given a record from the loan file, the subquery
> > should provide all records for the same bank that appear in the bank
> > branch file. Compute the distance from the loan county to each branch
> > county and output one record containing information from the loan
> > record and from the branch record corresponding to the smallest
> > distance. If this can be done, then no large temporary files are
> > created or sorted.
> > Thanks.
> > Dan
> The outer join _is_ restricted to only match bankIds, but apparently there
> are too many BankIds with large numbers of combinations of LoanCountyId and
> BranchCountyId (which are grouped by LoanBankId and LoanCountyId for
> selected min distance); or just too many BankIds
> Sounds like the data comes from input files... Are you indexing the SAS
> tables created from the data files ?
> Can you provide any 'rough' data metrics ? With these metrics we can better
> understand the domain of this problem.
> number of banks having loans in numbers of counties
> select nCounties, count(*) as nBankIds from
> ( select BankId, count(*) as nCounties from DL group by BankId )
> group by nCounties
> number of banks having branches in numbers of counties
> select nBranches, count(*) as nBankIds from
> ( select BankId, count(*) as nBranches from SL2 group by BankId)
> group by nBranches
> number of distances needing computation
> select count (*) as nPaths from DL,SL2 where DL.BankId = SL2.BankId
> Here is a speculative quick plan of attack if a sizeable portion of banks
> have loans and branches in the same county.
> Create list of banks with zero distance.
> Use same SQL with addtional "and Loans.CountyId = Branches.CountyId"
> Then do the problematic SQL again with additional "and Loans.BankId not in
> (select BankId from ZeroDistanceBanks)"
> Then merge ZeroDistanceBanks and NonZeroDistanceBanks.
> I have no idea if this will take longer or shorter, bigger or smaller.