Date: Mon, 10 May 2010 14:49:50 -0400
Reply-To: Chang Chung <chang_y_chung@HOTMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Chang Chung <chang_y_chung@HOTMAIL.COM>
Subject: Re: Code Challenge: matching names
On Mon, 10 May 2010 11:33:44 -0400, Fehd, Ronald J. (CDC/OSELS/NCPHI)
<rjf2@CDC.GOV> wrote:
...
>Abstract: Fuzzy matching technique:
...
>http://www.gasug.org/papers/index.htm
Hi,
A quick review of the code in the presentation file
(http://www.gasug.org/papers/GASUG_2010_April.pptx), shows that
a big chunk of the data step code is repeated twice.
If these parts are re-factored into a user-written function,
then the data step can be much simpler.
And the function can highlight what the author means by "fuzzy matching,"
as shown below.
I made a small change in the algorithm, though.
The original author compared the number of matched characters(x)
against the length of one of the names compared, i.e.:
x > round(length(ofac_first)*0.9)
I have it changed so that the calculation is based on the length of the
shortest name.
It also seems that the author considers it a match when a name is
completely embedded in the other (by using an index function); and the
first character is special in that it has to match in order for a pair of
names to be considered a match. These two conditions are not implemented
in my function, but can be easily added.
Finally, I would recommend reviewing built-in functions like soundex and
spedis, which may turn out to be a better tool for "fuzzy matching," then
this custom-made one.
Below ran on 9.2 (TS1M0) on W32_VSPRO. Hope this helps a bit.
Cheers,
Chang
proc fcmp outlib=work.myFcmp.util;
function nameMatch(name1 $, name2 $, threshold);
* returns 1 if names match, 0 otherwise.
* a match means the proportion of matched
* non-blank chars over the threashold,
* where 0 < threashold < 1.0.
* returns 0 also when names are matched trivially,
* ie. length(name1) = length(name2) = 0;
var len1 len2 minLen i nMatch;
len1 = length(name1);
len2 = length(name2);
if len1=0 or len2=0 then return (0);
nMatch = 0;
minLen = min(len1, len2);
do i = 1 to minLen;
nMatch + ( substr(name1,i,1)=substr(name2,i,1) );
end;
return (nMatch > round(minLen * threshold));
endsub;
quit;
/* check */
%let cmplib = %sysfunc(getoption(cmplib));
options cmplib = (work.myFcmp &cmplib);
data _null_;
ans1 = nameMatch("abc", "abc", 0.8);
ans2 = nameMatch("----+----0", "----+----1", 0.9);
ans3 = nameMatch("----+----0", "----+---11", 0.9);
put ans1= ans2= ans3=;
run;
/* on log
ans1=1 ans2=1 ans3=0
*/
options cmplib = &cmplib;
|