Date: Fri, 7 Feb 2003 19:00:52 GMT
Reply-To: julierog@ix.netcom.com
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Roger Lustig <trovato@BELLATLANTIC.NET>
Subject: Re: Merging datasets ...
Content-Type: text/plain; charset=us-ascii; format=flowed
Roger Lustig wrote:
> Next: a pointer lookup with both files sorted.
>
Here it is--minus the pointer. I modified Ian's code using the
assumption that both input files were sorted. (My code is below.)
I reran the format-style solution with 10 million numbers and 1.26
million intervals. Timing results:
--Prepare CNTLIN file and make format: 47.46/16.49 (total/CPU secs.)
--Apply format in a DATA step: 84.43/81.73.
The lookup approach with sorted files is very fast: 7.20/6.35.
BUT: sorting the 10 million numbers took 47.57/32.32, and sorting the
1.2 million intervals added a few seconds more. Still not enough to tip
the scales in this case, but we're no longer orders of magnitude apart.
Of course, this is a pretty unusual case, with more intervals than I
could ever imagine using for any purpose. I should cut the number of
intervals way down and test the effect on performance.
The intervals here are also somewhat perverse, being as narrow as
possible. A mix of interval widths might be interesting too.
=================================================
To choose a method of finding range-based IDs to a mess of numbers, you
need to consider some or all of the following:
--How many records are there in each file (numbers and ranges)?
--Are your files already sorted?
--How much other information is (and needs to be) on each record?
Sorting long records can increase sort time by a great deal.
--How often will you be doing this? Creating a lookup table or a format
may take time and resources, but you can save and reapply the result.
--How much memory do you have? Huge formats will take *very* long to
create once you bump up against your system's memory limit. Of course,
you can create the format on a system (even one on another
platform--we're talking SAS here) that has lots of memory, then access
the format on whatever machine(s) need to use it.
--Are you going to need the ID for data step programming, or can it be
applied by format in a PROC? If you use the lookup method, you may be
adding a data step, which can cost a bit in I/O if each record is big.
--How much coding do you wish to do? (I.e., how much coding, and of
what level of difficulty, do you want/need/expect/trust others to do?)
Applying the format is a matter of one statement that should have little
or no effect on the structure of a data step, if it's needed there.
Doing the lookup requires some knowledge of the fine points of the data
step; and if that data step is to accomplish other things too, there may
be some real thinking involved.
=================================================
Here's the code for the sorted lookup:
*** At top, FLAG and ID are set to null. XFLAG is not.;
data formatted (keep=number id);
*** neither are FROM, TO, and XID;
retain from to xid;
*** Get the next number;
set mysas.stuff ;
*** Possible in-range match? Check it out;
if not (xflag) then do until ( flag ) ;
*** Below or within next range;
if number <= to then do;
*** Stop loop;
flag = 1;
*** Within range--assign an ID;
if number >= from then id=xid;
end;
*** No more ranges to pull in, and too high for last range;
else if last then do;
*** Shut down the loop;
flag=1;
*** Don't even try the loop any more;
xflag + 1;
*** Leave ID null;
end;
*** Too high for last range, but more ranges to check;
else set mysas.ranges (rename=(id=xid)) end=last ;
*** Try again;
end ;
run ;
============================================================
Comments welcome!
Roger