| Date: | Tue, 28 Feb 2006 15:02:21 -0500 |
| Reply-To: | "Dorfman, Paul" <Paul.Dorfman@FCSO.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | "Dorfman, Paul" <Paul.Dorfman@FCSO.COM> |
| Subject: | How NOT to use the hash objects? |
| Content-Type: | text/plain; charset=iso-8859-1 |
I have spotted a rather intriguing title "Perform a fuzzy merge within a range
using DATA step component objects" amongst SAS support code samples (sample 89)
and, naturally, could not resist to take a look (probably I should have). The
sample is intended to demonstrate how to use DSCI objects to tackle the task
given below (note that I have tidied it up a bit by aligning fields,
uniform indentation, etc. and added the first and last records to the data
set ONE, which I will explain later):
data one ;
input lastname: $15. typeofcar: $15. mileage ;
datalines ;
Baker Chevy 2000
Jones Toyota 7435
Smith Toyota 13001
Jones2 Ford 3433
Smith2 Toyota 15032
Shepherd Nissan 4300
Shepherd2 Honda 5582
Williams Ford 10532
Zack BMW 20000
run ;
data two ;
input startrange endrange typeofservice & $35. ;
datalines ;
3000 5000 oil change
5001 6000 overdue oil change
6001 8000 oil change and tire rotation
8001 9000 overdue oil change
9001 11000 oil change
11001 12000 overdue oil change
12001 14000 oil change and tire rotation
14001 14999 overdue oil change
15000 15999 15000 mile check
run ;
So, what the "fuzzy merge" is supposed to do with these data sets?
Grab MILEAGE from ONE and find its range in TWO, then write the
corresponding TYPEOFSERVICE to data set OUT. Let's leave the
question of just how appropriate the term "fuzzy" may be in this
context aside because of... its fuzziness and concentrate on the
task at hand. The SAS tool specifically designed to search ranges
is SAS [in]formats, so the first thought would be to make TWO into
a file suitable for CNTLIN=, create the format, and go. SQL heads will
undoubtedly (and deservedly so) find a SQL join with BETWEEN most
elegant and concise. However, I was particularly curious how a hash
table might be applied to the problem, for dealing with ranges is the
Achilles heel of any search algorithm specifically tuned to find/reject
exact matches. Here is the hash approach offered in the Sample 89
(also with added spaces/indents, but otherwise intact):
data out (keep = lastname typeofcar mileage typeofservice) ;
length startrange endrange 8 typeofservice $35 ;
if _n_ = 1 then do ;
declare hash h (dataset:'two', hashexp:4);
h.definekey ('startrange');
h.definedata ('startrange','endrange','typeofservice') ;
declare hiter hiter('h') ;
h.definedone () ;
call missing (startrange, endrange, typeofservice) ;
end ;
set one ;
rc = hiter.first() ;
do while (rc = 0) ;
if startrange le mileage le endrange then leave ;
rc1 = hiter.next() ;
end ;
run ;
What is the grand strategery here? Load TWO into a hash table
keyed by *something* (above, it is [unique] STARTRANGE, although
_N_ would be a better candidate). Then grab MILEAGE from ONE and
deposit the entries exactly equivalent to the records from TWO
from the hash object into PDV one at a time until MILEAGE falls
between STARTRANGE and ENDRANGE. In other words, do a *sequential*
search through the hash table, each act of which additionally
involves the moving of a whole data record into the PDV. Such modus
operandi erases the whole advantage of using a hash table, whose
purpose is to eliminate the need of searching sequentially.
So, should the entire hash approach be scrapped?
Before considering the question, let us move a couple of sloppier
pieces out of the way. Is there anything else wrong with the proposed
code? Well, yeah, and anyone who would sumbit the DATA step as above
against the input as above will find that pretty soon - or rather never,
depending on user intervention. As I stated above, the original input
data ONE contained neither BAKER nor ZACK whose MILEAGE falls out of any
range in TWO. Any of these records will cause the do-while loop to iterate
infinitely because neither of the conditions (rc=0) or
(startrange le mileage le endrange) will be satisfied. I can only wonder
what made the developer to code RC1 instead of RC inside the loop. And then
fixing this glitch, even though it will stop the loop, will still fail to
make TYPEOFSERVICE missing when MILEAGE is not in any range, and retain its
previous value, so a conditional CALL MISSING is due.
Let us now see if the hash can indeed be a useful tool for this type of
task. If the range were character, I would say "no", use another technique.
Here, though, the range is limited and integer, so the "paintbrush" method
can be employed:
data out2 (keep = lastname typeofcar mileage typeofservice) ;
if _n_ = 1 then do ;
declare hash h () ;
h.definekey ('mileage' ) ;
h.definedata ('typeofservice') ;
h.definedone () ;
do until (z) ;
set two end = z ;
do mileage = startrange to endrange ;
h.replace() ;
end ;
end ;
end ;
set one ;
if h.find() ne 0 then call missing (typeofservice) ;
run ;
Sure it causes the table to house data for each value in the range,
but now searching for it is done in but a single shot. In the future,
when SAS adds the h.range(k1,k2) method to the roster, paintbrushing
will be thus rendered unnecessary. OK, I just made this one up: I do
not know if the method will be added or not, but from the data structure
standpoint it should not be terribly difficult to do because underneath
the hood lurking are the AVL trees (if a table can be internally ordered,
as is already the case, finding if a key falls in a range is a simple
tree operation).
Finally, the same as above can of course be done using the lowly array:
data out3 (keep = lastname typeofcar mileage typeofservice) ;
array r [99999] $35 _temporary_ ;
if _n_ = 1 then do _n_ = 1 by 1 until (z) ;
set two end = z ;
do mileage = startrange to endrange ;
r[mileage] = typeofservice ;
end ;
end ;
set one ;
typeofservice = r[mileage] ;
run ;
The advantage of the hash table is that its size does not have to be computed or guesses beforehand, but [this] array solution will run appreciably faster.
Like Mark uses to say, hope it was helpful <g>.
Kind regards
------------
Paul Dorfman
Jax, FL
------------
First Coast Service Options, Inc., and its affiliates are not responsible for errors or omissions in the transmission of this e-mail message. Any personal comments made in this e-mail do not reflect the views of First Coast Service Options, Inc., or its affiliates. The information contained in this document may be confidential and is intended solely for the use of the individual or entity to whom it is addressed. This document may also contain material that is privileged or protected from disclosure under applicable law. If you are not the intended recipient or the individual responsible for delivery to the intended recipient, please (1) be advised that any use, dissemination, forwarding, or copying of this document IS STRICTLY PROHIBITED; and (2) notify sender immediately by telephone and destroy the document. Thank you.
|