LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (February 2006, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Tue, 28 Feb 2006 15:02:21 -0500
Reply-To:   "Dorfman, Paul" <Paul.Dorfman@FCSO.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Dorfman, Paul" <Paul.Dorfman@FCSO.COM>
Subject:   How NOT to use the hash objects?
Content-Type:   text/plain; charset=iso-8859-1

I have spotted a rather intriguing title "Perform a fuzzy merge within a range using DATA step component objects" amongst SAS support code samples (sample 89) and, naturally, could not resist to take a look (probably I should have). The sample is intended to demonstrate how to use DSCI objects to tackle the task given below (note that I have tidied it up a bit by aligning fields, uniform indentation, etc. and added the first and last records to the data set ONE, which I will explain later):

data one ; input lastname: $15. typeofcar: $15. mileage ; datalines ; Baker Chevy 2000 Jones Toyota 7435 Smith Toyota 13001 Jones2 Ford 3433 Smith2 Toyota 15032 Shepherd Nissan 4300 Shepherd2 Honda 5582 Williams Ford 10532 Zack BMW 20000 run ;

data two ; input startrange endrange typeofservice & $35. ; datalines ; 3000 5000 oil change 5001 6000 overdue oil change 6001 8000 oil change and tire rotation 8001 9000 overdue oil change 9001 11000 oil change 11001 12000 overdue oil change 12001 14000 oil change and tire rotation 14001 14999 overdue oil change 15000 15999 15000 mile check run ;

So, what the "fuzzy merge" is supposed to do with these data sets? Grab MILEAGE from ONE and find its range in TWO, then write the corresponding TYPEOFSERVICE to data set OUT. Let's leave the question of just how appropriate the term "fuzzy" may be in this context aside because of... its fuzziness and concentrate on the task at hand. The SAS tool specifically designed to search ranges is SAS [in]formats, so the first thought would be to make TWO into a file suitable for CNTLIN=, create the format, and go. SQL heads will undoubtedly (and deservedly so) find a SQL join with BETWEEN most elegant and concise. However, I was particularly curious how a hash table might be applied to the problem, for dealing with ranges is the Achilles heel of any search algorithm specifically tuned to find/reject exact matches. Here is the hash approach offered in the Sample 89 (also with added spaces/indents, but otherwise intact):

data out (keep = lastname typeofcar mileage typeofservice) ; length startrange endrange 8 typeofservice $35 ; if _n_ = 1 then do ; declare hash h (dataset:'two', hashexp:4); h.definekey ('startrange'); h.definedata ('startrange','endrange','typeofservice') ; declare hiter hiter('h') ; h.definedone () ; call missing (startrange, endrange, typeofservice) ; end ; set one ; rc = hiter.first() ; do while (rc = 0) ; if startrange le mileage le endrange then leave ; rc1 = hiter.next() ; end ; run ;

What is the grand strategery here? Load TWO into a hash table keyed by *something* (above, it is [unique] STARTRANGE, although _N_ would be a better candidate). Then grab MILEAGE from ONE and deposit the entries exactly equivalent to the records from TWO from the hash object into PDV one at a time until MILEAGE falls between STARTRANGE and ENDRANGE. In other words, do a *sequential* search through the hash table, each act of which additionally involves the moving of a whole data record into the PDV. Such modus operandi erases the whole advantage of using a hash table, whose purpose is to eliminate the need of searching sequentially. So, should the entire hash approach be scrapped?

Before considering the question, let us move a couple of sloppier pieces out of the way. Is there anything else wrong with the proposed code? Well, yeah, and anyone who would sumbit the DATA step as above against the input as above will find that pretty soon - or rather never, depending on user intervention. As I stated above, the original input data ONE contained neither BAKER nor ZACK whose MILEAGE falls out of any range in TWO. Any of these records will cause the do-while loop to iterate infinitely because neither of the conditions (rc=0) or (startrange le mileage le endrange) will be satisfied. I can only wonder what made the developer to code RC1 instead of RC inside the loop. And then fixing this glitch, even though it will stop the loop, will still fail to make TYPEOFSERVICE missing when MILEAGE is not in any range, and retain its previous value, so a conditional CALL MISSING is due.

Let us now see if the hash can indeed be a useful tool for this type of task. If the range were character, I would say "no", use another technique. Here, though, the range is limited and integer, so the "paintbrush" method can be employed:

data out2 (keep = lastname typeofcar mileage typeofservice) ; if _n_ = 1 then do ; declare hash h () ; h.definekey ('mileage' ) ; h.definedata ('typeofservice') ; h.definedone () ; do until (z) ; set two end = z ; do mileage = startrange to endrange ; h.replace() ; end ; end ; end ; set one ; if h.find() ne 0 then call missing (typeofservice) ; run ;

Sure it causes the table to house data for each value in the range, but now searching for it is done in but a single shot. In the future, when SAS adds the h.range(k1,k2) method to the roster, paintbrushing will be thus rendered unnecessary. OK, I just made this one up: I do not know if the method will be added or not, but from the data structure standpoint it should not be terribly difficult to do because underneath the hood lurking are the AVL trees (if a table can be internally ordered, as is already the case, finding if a key falls in a range is a simple tree operation).

Finally, the same as above can of course be done using the lowly array:

data out3 (keep = lastname typeofcar mileage typeofservice) ; array r [99999] $35 _temporary_ ; if _n_ = 1 then do _n_ = 1 by 1 until (z) ; set two end = z ; do mileage = startrange to endrange ; r[mileage] = typeofservice ; end ; end ; set one ; typeofservice = r[mileage] ; run ;

The advantage of the hash table is that its size does not have to be computed or guesses beforehand, but [this] array solution will run appreciably faster.

Like Mark uses to say, hope it was helpful <g>.

Kind regards ------------ Paul Dorfman Jax, FL ------------

First Coast Service Options, Inc., and its affiliates are not responsible for errors or omissions in the transmission of this e-mail message. Any personal comments made in this e-mail do not reflect the views of First Coast Service Options, Inc., or its affiliates. The information contained in this document may be confidential and is intended solely for the use of the individual or entity to whom it is addressed. This document may also contain material that is privileged or protected from disclosure under applicable law. If you are not the intended recipient or the individual responsible for delivery to the intended recipient, please (1) be advised that any use, dissemination, forwarding, or copying of this document IS STRICTLY PROHIBITED; and (2) notify sender immediately by telephone and destroy the document. Thank you.


Back to: Top of message | Previous page | Main SAS-L page