```Date: Mon, 10 May 2010 14:49:50 -0400 Reply-To: Chang Chung Sender: "SAS(r) Discussion" From: Chang Chung Subject: Re: Code Challenge: matching names Comments: To: Ron Fehd On Mon, 10 May 2010 11:33:44 -0400, Fehd, Ronald J. (CDC/OSELS/NCPHI) wrote: ... >Abstract: Fuzzy matching technique: ... >http://www.gasug.org/papers/index.htm Hi, A quick review of the code in the presentation file (http://www.gasug.org/papers/GASUG_2010_April.pptx), shows that a big chunk of the data step code is repeated twice. If these parts are re-factored into a user-written function, then the data step can be much simpler. And the function can highlight what the author means by "fuzzy matching," as shown below. I made a small change in the algorithm, though. The original author compared the number of matched characters(x) against the length of one of the names compared, i.e.: x > round(length(ofac_first)*0.9) I have it changed so that the calculation is based on the length of the shortest name. It also seems that the author considers it a match when a name is completely embedded in the other (by using an index function); and the first character is special in that it has to match in order for a pair of names to be considered a match. These two conditions are not implemented in my function, but can be easily added. Finally, I would recommend reviewing built-in functions like soundex and spedis, which may turn out to be a better tool for "fuzzy matching," then this custom-made one. Below ran on 9.2 (TS1M0) on W32_VSPRO. Hope this helps a bit. Cheers, Chang proc fcmp outlib=work.myFcmp.util; function nameMatch(name1 \$, name2 \$, threshold); * returns 1 if names match, 0 otherwise. * a match means the proportion of matched * non-blank chars over the threashold, * where 0 < threashold < 1.0. * returns 0 also when names are matched trivially, * ie. length(name1) = length(name2) = 0; var len1 len2 minLen i nMatch; len1 = length(name1); len2 = length(name2); if len1=0 or len2=0 then return (0); nMatch = 0; minLen = min(len1, len2); do i = 1 to minLen; nMatch + ( substr(name1,i,1)=substr(name2,i,1) ); end; return (nMatch > round(minLen * threshold)); endsub; quit; /* check */ %let cmplib = %sysfunc(getoption(cmplib)); options cmplib = (work.myFcmp &cmplib); data _null_; ans1 = nameMatch("abc", "abc", 0.8); ans2 = nameMatch("----+----0", "----+----1", 0.9); ans3 = nameMatch("----+----0", "----+---11", 0.9); put ans1= ans2= ans3=; run; /* on log ans1=1 ans2=1 ans3=0 */ options cmplib = &cmplib; ```

Back to: Top of message | Previous page | Main SAS-L page