LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2003, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 17 Jun 2003 10:32:47 -0500
Reply-To:     Rodney Sparapani <rsparapa@MCW.EDU>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Rodney Sparapani <rsparapa@MCW.EDU>
Organization: Medical College of Wisconsin, Milwaukee
Subject:      Re: Actively seeking algorithm to compare the "likeness" of two
              character
Content-Type: multipart/mixed;

This is a multi-part message in MIME format. --------------050904070302040207030006 Content-Type: text/plain; charset=us-ascii; format=flowed Content-Transfer-Encoding: 7bit

Susie Li wrote:

>I appreciate Charles' suggestions. I'm somewhat aware of the existence >of >the tricks and rules mentioned for cleaning or standardizing data. For >my >set of data, not huge, I ended up manually checking the results. > >I thought I read in this list that there may be some categorical >clustering >algorithm to roughly group similar strings into pattern groups, >something >like, > >proc prinqual n=1 data=test out=testout; > transform opscore(charVar); > >Does anyone know this? > > I don't know about that. But, it is relatively trivial to search for matches based on mis-spellings or perhaps alternative spellings which are similar. I believe that I posted some code to SAS-L to do it using a cosine-like method based on the likelihood. It only took me a couple of hours to develop the algorithm, code it and tweak it. I'm attaching that code. Here's an excerpt of the output. In each case, the desired match has the highest score. You might improve on this by using more informative probabilities, i.e. z and x are less frequent than a and e.

lastn search max

sauer sawer 13.0716 baewer sawer 10.0096 lawyer sawer 10.0096 sander sawer 10.0096 sanger sawer 10.0096

schapira shapira 19.8231 schapira shapira 19.8231 shapiro shapira 19.5878 shapiro shapira 19.5878 sheppard shapira 13.3854

-- Rodney Sparapani Medical College of Wisconsin Sr. Biostatistician Patient Care & Outcomes Research rsparapa@mcw.edu http://www.mcw.edu/pcor Was 'Name That Tune' rigged? WWLD -- What Would Lombardi Do

--------------050904070302040207030006 Content-Type: text/plain; name="pmatch.sas" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename="pmatch.sas"

data names; infile "../mcw.txt" firstobs=3; length dummy $ 5 alias firstn middlen lastn $ 30 comment $ 50; input dummy $ alias comment &;

i=index(alias, '.');

firstn=substr(alias, 1, i-1); lastn=substr(alias, i+1);

i=index(lastn, '.');

if i then do; middlen=substr(lastn, 1, i-1); lastn=substr(lastn, i+1);

if index(lastn, '.') then lastn=translate(lastn, ' ', '.'); end; run;

proc sort data=names; by lastn firstn; run;

%macro main(arg, obs);

data search; search="&arg"; output; run;

data search; retain a b; if _n_=1 then do; set search point=_n_;

a=log(26); b=a-log(25); end;

set names;

m=length(search); n=length(lastn);

max=0; k=-abs(n-m)*a;

if m=n then do; do j=1 to m; l=(substr(search, j, 1)=substr(lastn, j, 1)); k=k+l*a+(1-l)*b; end; if k>max then max=k; end; else if m<n then do i=0 to n-m; do j=1 to m; l=(substr(search, j, 1)=substr(lastn, j+i, 1)); k=k+l*a+(1-l)*b; end; if k>max then max=k; end; else if m>n then do i=0 to m-n; do j=1 to n; l=(substr(search, j+i, 1)=substr(lastn, j, 1)); k=k+l*a+(1-l)*b; end; if k>max then max=k; end; run;

proc sort data=search; by descending max lastn; run;

proc print data=search(obs=&obs); id lastn; var search max; run;

%mend main;

%main(sawer, 5); %main(weirshke, 5); %main(shapira, 5); %main(sparpani, 5); %main(zang, 5);

--------------050904070302040207030006--

-------


Back to: Top of message | Previous page | Main SAS-L page