LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (February 2003, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 7 Feb 2003 19:00:52 GMT
Reply-To:     julierog@ix.netcom.com
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Roger Lustig <trovato@BELLATLANTIC.NET>
Subject:      Re: Merging datasets ...
Content-Type: text/plain; charset=us-ascii; format=flowed

Roger Lustig wrote: > Next: a pointer lookup with both files sorted. >

Here it is--minus the pointer. I modified Ian's code using the assumption that both input files were sorted. (My code is below.)

I reran the format-style solution with 10 million numbers and 1.26 million intervals. Timing results: --Prepare CNTLIN file and make format: 47.46/16.49 (total/CPU secs.) --Apply format in a DATA step: 84.43/81.73.

The lookup approach with sorted files is very fast: 7.20/6.35.

BUT: sorting the 10 million numbers took 47.57/32.32, and sorting the 1.2 million intervals added a few seconds more. Still not enough to tip the scales in this case, but we're no longer orders of magnitude apart.

Of course, this is a pretty unusual case, with more intervals than I could ever imagine using for any purpose. I should cut the number of intervals way down and test the effect on performance.

The intervals here are also somewhat perverse, being as narrow as possible. A mix of interval widths might be interesting too.

=================================================

To choose a method of finding range-based IDs to a mess of numbers, you need to consider some or all of the following:

--How many records are there in each file (numbers and ranges)?

--Are your files already sorted?

--How much other information is (and needs to be) on each record? Sorting long records can increase sort time by a great deal.

--How often will you be doing this? Creating a lookup table or a format may take time and resources, but you can save and reapply the result.

--How much memory do you have? Huge formats will take *very* long to create once you bump up against your system's memory limit. Of course, you can create the format on a system (even one on another platform--we're talking SAS here) that has lots of memory, then access the format on whatever machine(s) need to use it.

--Are you going to need the ID for data step programming, or can it be applied by format in a PROC? If you use the lookup method, you may be adding a data step, which can cost a bit in I/O if each record is big.

--How much coding do you wish to do? (I.e., how much coding, and of what level of difficulty, do you want/need/expect/trust others to do?) Applying the format is a matter of one statement that should have little or no effect on the structure of a data step, if it's needed there. Doing the lookup requires some knowledge of the fine points of the data step; and if that data step is to accomplish other things too, there may be some real thinking involved.

=================================================

Here's the code for the sorted lookup:

*** At top, FLAG and ID are set to null. XFLAG is not.; data formatted (keep=number id);

*** neither are FROM, TO, and XID; retain from to xid;

*** Get the next number; set mysas.stuff ;

*** Possible in-range match? Check it out; if not (xflag) then do until ( flag ) ;

*** Below or within next range; if number <= to then do;

*** Stop loop; flag = 1;

*** Within range--assign an ID; if number >= from then id=xid; end;

*** No more ranges to pull in, and too high for last range; else if last then do;

*** Shut down the loop; flag=1;

*** Don't even try the loop any more; xflag + 1;

*** Leave ID null; end;

*** Too high for last range, but more ranges to check; else set mysas.ranges (rename=(id=xid)) end=last ;

*** Try again; end ;

run ; ============================================================

Comments welcome!

Roger


Back to: Top of message | Previous page | Main SAS-L page