LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (October 2009, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 7 Oct 2009 19:00:49 -0400
Reply-To:     Paul Dorfman <sashole@BELLSOUTH.NET>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Paul Dorfman <sashole@BELLSOUTH.NET>
Subject:      Re: Better way to merge these datasets??
Comments: To: Andy A <neal@FOLKARTINBOTTLES.COM>

Andy,

Change the method of combining the data sets. First, index the original full-set files by the surrogate key.

Then merge a given subset to the corresponding full sets, use SQL join instead of MERGE. The latter, even in the presence of indexes, reads each record from each file (such is the nature of the ordered list merge algorithm), but the former will use the index on a full set joining a small subset with it.

Alternatively, instead of using SQL, use a DATA step where the subset is read sequentially, but the full sets - directly via KEY= option. That would look something like so (assuming you are performing an equivalent of inner join of subset_a on "its" full sets):

data out ; set subset_a ; set set_b key = surr_key / unique ; if _iorc_ ne 0 then do ; delete ; _error_ = 0 ; end ; set set_c key = surr_key / unique ; if _iorc_ ne 0 then do ; delete ; _error_ = 0 ; end ; ... * and so on for each remaining full set ; run ;

If you know beforehand that a subset is 30 percent or larger compared to some full sets being combined with it, you can instead aggregate those full sets in a MERGE and use direct reads only against those full sets that are more than 70 per cent larger than the subset. For example, if above subset_a is more than 30 per cent of sets E and H, say, we would have

data out ; merge subset_a (in=a) set_e (in=e) set_h (in = h) ; by surr_key ; if a and e and h ; set set_b key = surr_key / unique ; if _iorc_ ne 0 then do ; delete ; _error_ = 0 ; end ; set set_c key = surr_key / unique ; if _iorc_ ne 0 then do ; delete ; _error_ = 0 ; end ; ... * and so on for each remaining full set ; run ;

However, I would still use SQL because it is likely that the optimizer would take all this into consideration automatically, and it is less fuzzy to program.

Kind regards ------------ Paul Dorfman Jax, FL ------------

On Wed, 7 Oct 2009 15:22:19 -0400, Andy A <neal@FOLKARTINBOTTLES.COM> wrote:

>Hi all, > >I have a crazy program that I'm hoping you can help make more efficient. >I can't go into the logic as to what it's doing w/o writing a novel so I'm >going to try to keep it short. > >Essentially I have 10 datasets (A-J) that I need to merge together. Each >is a simple merge by primary/foreign keys that are simple auto_increment >fields. The datasets come from Oracle journal tables so some of them are >QUITE large. > >Before I merge them together, I have to subset each dataset by the journal >time. However I have to merge each subset dataset back in with the full >pre-subset dataset (haha, lost you there I bet!). So I'm merging: (subset >of A) -> B -> C -> D... I then take A -> (subset B) -> C -> D... then A - >> B -> (subset C) -> D... I do this for all 10 tables. In the end I set >them all together, remove any possible duplicates, and some other logic >not necessarily related to this post. > >Where I'm running into problems is that table J (or any table) may only >have 1 or 2 records after being subsetted. However since J is the last to >be merged in, the program is going through and merging all of the full >datasets, thus creating a large dataset that takes a while to process >(400,000+ records). If tables H, I, and J only have 1 or 2 records in the >subset, the program is processing those 400,000+ records through each >iteration). If there are 0 observations in a subsetted dataset, I do >bypass the entire merge sequence. > >Any ideas on how a better way to merge these datasets? It seems if I >could merge H -> I -> J -> G -> F....-> A then I -> J -> H -> G... > >I'm also merging via datastep meaning I'm having to sort those 400,000 >before merging in each time. Would SQL be more efficient? > >Thanks in advance!


Back to: Top of message | Previous page | Main SAS-L page