Date: Tue, 17 Jun 2003 10:32:47 -0500
Reply-To: Rodney Sparapani <rsparapa@MCW.EDU>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Rodney Sparapani <rsparapa@MCW.EDU>
Organization: Medical College of Wisconsin, Milwaukee
Subject: Re: Actively seeking algorithm to compare the "likeness" of two
character
Content-Type: multipart/mixed;
This is a multi-part message in MIME format.
--------------050904070302040207030006
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Susie Li wrote:
>I appreciate Charles' suggestions. I'm somewhat aware of the existence
>of
>the tricks and rules mentioned for cleaning or standardizing data. For
>my
>set of data, not huge, I ended up manually checking the results.
>
>I thought I read in this list that there may be some categorical
>clustering
>algorithm to roughly group similar strings into pattern groups,
>something
>like,
>
>proc prinqual n=1 data=test out=testout;
> transform opscore(charVar);
>
>Does anyone know this?
>
>
I don't know about that. But, it is relatively trivial to search for
matches based on mis-spellings or perhaps alternative spellings which
are similar. I believe that I posted some code to SAS-L to do it
using a cosine-like method based on the likelihood. It only took me a
couple of hours to develop the algorithm, code it and tweak it.
I'm attaching that code. Here's an excerpt of the output. In each case,
the desired match has the highest score. You might improve on this by
using more informative probabilities, i.e. z and x are less frequent than
a and e.
lastn search max
sauer sawer 13.0716
baewer sawer 10.0096
lawyer sawer 10.0096
sander sawer 10.0096
sanger sawer 10.0096
schapira shapira 19.8231
schapira shapira 19.8231
shapiro shapira 19.5878
shapiro shapira 19.5878
sheppard shapira 13.3854
--
Rodney Sparapani Medical College of Wisconsin
Sr. Biostatistician Patient Care & Outcomes Research
rsparapa@mcw.edu http://www.mcw.edu/pcor
Was 'Name That Tune' rigged? WWLD -- What Would Lombardi Do
--------------050904070302040207030006
Content-Type: text/plain;
name="pmatch.sas"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="pmatch.sas"
data names;
infile "../mcw.txt" firstobs=3;
length dummy $ 5 alias firstn middlen lastn $ 30 comment $ 50;
input dummy $ alias comment &;
i=index(alias, '.');
firstn=substr(alias, 1, i-1);
lastn=substr(alias, i+1);
i=index(lastn, '.');
if i then do;
middlen=substr(lastn, 1, i-1);
lastn=substr(lastn, i+1);
if index(lastn, '.') then lastn=translate(lastn, ' ', '.');
end;
run;
proc sort data=names;
by lastn firstn;
run;
%macro main(arg, obs);
data search;
search="&arg";
output;
run;
data search;
retain a b;
if _n_=1 then do;
set search point=_n_;
a=log(26);
b=a-log(25);
end;
set names;
m=length(search);
n=length(lastn);
max=0;
k=-abs(n-m)*a;
if m=n then do;
do j=1 to m;
l=(substr(search, j, 1)=substr(lastn, j, 1));
k=k+l*a+(1-l)*b;
end;
if k>max then max=k;
end;
else if m<n then do i=0 to n-m;
do j=1 to m;
l=(substr(search, j, 1)=substr(lastn, j+i, 1));
k=k+l*a+(1-l)*b;
end;
if k>max then max=k;
end;
else if m>n then do i=0 to m-n;
do j=1 to n;
l=(substr(search, j+i, 1)=substr(lastn, j, 1));
k=k+l*a+(1-l)*b;
end;
if k>max then max=k;
end;
run;
proc sort data=search;
by descending max lastn;
run;
proc print data=search(obs=&obs);
id lastn;
var search max;
run;
%mend main;
%main(sawer, 5);
%main(weirshke, 5);
%main(shapira, 5);
%main(sparpani, 5);
%main(zang, 5);
--------------050904070302040207030006--
-------