Date: Wed, 7 Oct 2009 19:00:49 0400
ReplyTo: Paul Dorfman <sashole@BELLSOUTH.NET>
Sender: "SAS(r) Discussion" <SASL@LISTSERV.UGA.EDU>
From: Paul Dorfman <sashole@BELLSOUTH.NET>
Subject: Re: Better way to merge these datasets??
Andy,
Change the method of combining the data sets. First, index the original
fullset files by the surrogate key.
Then merge a given subset to the corresponding full sets, use SQL join
instead of MERGE. The latter, even in the presence of indexes, reads each
record from each file (such is the nature of the ordered list merge
algorithm), but the former will use the index on a full set joining a small
subset with it.
Alternatively, instead of using SQL, use a DATA step where the subset is
read sequentially, but the full sets  directly via KEY= option. That would
look something like so (assuming you are performing an equivalent of inner
join of subset_a on "its" full sets):
data out ;
set subset_a ;
set set_b key = surr_key / unique ;
if _iorc_ ne 0 then do ; delete ; _error_ = 0 ; end ;
set set_c key = surr_key / unique ;
if _iorc_ ne 0 then do ; delete ; _error_ = 0 ; end ;
...
* and so on for each remaining full set ;
run ;
If you know beforehand that a subset is 30 percent or larger compared to
some full sets being combined with it, you can instead aggregate those full
sets in a MERGE and use direct reads only against those full sets that are
more than 70 per cent larger than the subset. For example, if above subset_a
is more than 30 per cent of sets E and H, say, we would have
data out ;
merge subset_a (in=a) set_e (in=e) set_h (in = h) ;
by surr_key ;
if a and e and h ;
set set_b key = surr_key / unique ;
if _iorc_ ne 0 then do ; delete ; _error_ = 0 ; end ;
set set_c key = surr_key / unique ;
if _iorc_ ne 0 then do ; delete ; _error_ = 0 ; end ;
...
* and so on for each remaining full set ;
run ;
However, I would still use SQL because it is likely that the optimizer would
take all this into consideration automatically, and it is less fuzzy to
program.
Kind regards

Paul Dorfman
Jax, FL

On Wed, 7 Oct 2009 15:22:19 0400, Andy A <neal@FOLKARTINBOTTLES.COM> wrote:
>Hi all,
>
>I have a crazy program that I'm hoping you can help make more efficient.
>I can't go into the logic as to what it's doing w/o writing a novel so I'm
>going to try to keep it short.
>
>Essentially I have 10 datasets (AJ) that I need to merge together. Each
>is a simple merge by primary/foreign keys that are simple auto_increment
>fields. The datasets come from Oracle journal tables so some of them are
>QUITE large.
>
>Before I merge them together, I have to subset each dataset by the journal
>time. However I have to merge each subset dataset back in with the full
>presubset dataset (haha, lost you there I bet!). So I'm merging: (subset
>of A) > B > C > D... I then take A > (subset B) > C > D... then A 
>> B > (subset C) > D... I do this for all 10 tables. In the end I set
>them all together, remove any possible duplicates, and some other logic
>not necessarily related to this post.
>
>Where I'm running into problems is that table J (or any table) may only
>have 1 or 2 records after being subsetted. However since J is the last to
>be merged in, the program is going through and merging all of the full
>datasets, thus creating a large dataset that takes a while to process
>(400,000+ records). If tables H, I, and J only have 1 or 2 records in the
>subset, the program is processing those 400,000+ records through each
>iteration). If there are 0 observations in a subsetted dataset, I do
>bypass the entire merge sequence.
>
>Any ideas on how a better way to merge these datasets? It seems if I
>could merge H > I > J > G > F....> A then I > J > H > G...
>
>I'm also merging via datastep meaning I'm having to sort those 400,000
>before merging in each time. Would SQL be more efficient?
>
>Thanks in advance!
