Date: Sun, 9 Sep 2007 22:25:17 -0700
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: fuzzy match of two name variables
In-Reply-To: <CA8F89971ADA9F47A6C915BA2397844207B41F3A@MAILBE2.westat.com>
Content-Type: text/plain; format=flowed
HERMANS1@WESTAT.COM sagely replied:
>-----Original Message-----
>From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
>On Behalf Of lisiqi77@yahoo.com
>Sent: Friday, September 07, 2007 10:15 PM
>To: sas-l@uga.edu
>Subject: fuzzy match of two name variables
>
>
>Hi, All,
>
>I have a dataset (with obs around 20,000) which contains two variables,
>Var1 and Var2. Both of them are either persons' names or entities'
>names. What I want to do is to find the cases where Var1=Var2.
>
>The problems are:
>
>1) names are the only identifer I have;
>2) both variables could contain spelling errors (e.g., Fidelity vs.
>Fiedelity) or variations of one name (e.g., Fidelity management vs.
>Fidelity MGMT Inc.).
>
>I've stanardized both variables by turning them into upcases, deleting
>special characters, removing special suffix (such as INC), and deleting
>multiple blanks, etc.
>
>I am wondering if functions such as SOUNDEX, SPEDIS, or COMPGED will
>help here. Or something else in the fuzzy match category? (I
>understand that probably no matter which method I use, I still have to
>mannual check the matched result.)
>
>Examples of Var1 look like the following:
>
>A. Alfred Taubman
>A.I.M. Overseas Ltd
>ABBOTT LABS STOCK RETIREMENT TRUST
>ABDULLAH TAHA BAKHSH
>ABELE; JOHN E.
>ABRAMSON; LEONARD
>ACKERMAN; JOEL
>ACKERMANS & VAN HAAREN GROUP
>ACORN FUND A SERIES OF THE ACORN INVESTM
>ACORN FUND-A SERIES OF THE ACORN INVESTM
>ACTINIUM HOLDING CORP
>ADAMS; MARY C.
>...
>
>
>Thanks very much for your comments!
>
>Tracy
>Tracy:
>SOUNDEX, SPEDIS, and COMPGED work fairly well for fuzzy matching
>problems of around 20,000 sets of identifiers. The SAS-L Archives
>contain many descriptions of how each of these function/operators work.
>Please write back if you have questions.
>S
>
I personally prefer COMPGED, because you can fudge the matching
penalties with the CALL COMPCOST() routine first.
I find SOUNDEX() to be unfortunately simplistic in cases like this.
If the only thing you have to worry about is 'smith' vs. 'smyth'
and 'steve' vs. 'stevie', then the soundex algorithm will do. But
simple changes, like 'knuth' to 'nuth' will completely confuse
soundex, and comparing two soundex strings is really hard unless
they are identical.
HTCT,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
Test your celebrity IQ. Play Red Carpet Reveal and earn great prizes!
http://club.live.com/red_carpet_reveal.aspx?icid=redcarpet_hotmailtextlink2
|