LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2010, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 10 May 2010 14:49:50 -0400
Reply-To:     Chang Chung <chang_y_chung@HOTMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Chang Chung <chang_y_chung@HOTMAIL.COM>
Subject:      Re: Code Challenge: matching names
Comments: To: Ron Fehd <rjf2@CDC.GOV>

On Mon, 10 May 2010 11:33:44 -0400, Fehd, Ronald J. (CDC/OSELS/NCPHI) <rjf2@CDC.GOV> wrote: ... >Abstract: Fuzzy matching technique: ... >http://www.gasug.org/papers/index.htm

Hi,

A quick review of the code in the presentation file (http://www.gasug.org/papers/GASUG_2010_April.pptx), shows that a big chunk of the data step code is repeated twice. If these parts are re-factored into a user-written function, then the data step can be much simpler. And the function can highlight what the author means by "fuzzy matching," as shown below.

I made a small change in the algorithm, though. The original author compared the number of matched characters(x) against the length of one of the names compared, i.e.: x > round(length(ofac_first)*0.9) I have it changed so that the calculation is based on the length of the shortest name.

It also seems that the author considers it a match when a name is completely embedded in the other (by using an index function); and the first character is special in that it has to match in order for a pair of names to be considered a match. These two conditions are not implemented in my function, but can be easily added.

Finally, I would recommend reviewing built-in functions like soundex and spedis, which may turn out to be a better tool for "fuzzy matching," then this custom-made one.

Below ran on 9.2 (TS1M0) on W32_VSPRO. Hope this helps a bit.

Cheers, Chang

proc fcmp outlib=work.myFcmp.util; function nameMatch(name1 $, name2 $, threshold); * returns 1 if names match, 0 otherwise. * a match means the proportion of matched * non-blank chars over the threashold, * where 0 < threashold < 1.0. * returns 0 also when names are matched trivially, * ie. length(name1) = length(name2) = 0; var len1 len2 minLen i nMatch; len1 = length(name1); len2 = length(name2); if len1=0 or len2=0 then return (0); nMatch = 0; minLen = min(len1, len2); do i = 1 to minLen; nMatch + ( substr(name1,i,1)=substr(name2,i,1) ); end; return (nMatch > round(minLen * threshold)); endsub; quit;

/* check */ %let cmplib = %sysfunc(getoption(cmplib)); options cmplib = (work.myFcmp &cmplib);

data _null_; ans1 = nameMatch("abc", "abc", 0.8); ans2 = nameMatch("----+----0", "----+----1", 0.9); ans3 = nameMatch("----+----0", "----+---11", 0.9); put ans1= ans2= ans3=; run; /* on log ans1=1 ans2=1 ans3=0 */

options cmplib = &cmplib;


Back to: Top of message | Previous page | Main SAS-L page