SOUNDEX, SPEDIS, and COMPGED work fairly well for fuzzy matching
problems of around 20,000 sets of identifiers. The SAS-L Archives
contain many descriptions of how each of these function/operators work.
Please write back if you have questions.
From: firstname.lastname@example.org [mailto:email@example.com]
On Behalf Of firstname.lastname@example.org
Sent: Friday, September 07, 2007 10:15 PM
Subject: fuzzy match of two name variables
I have a dataset (with obs around 20,000) which contains two variables,
Var1 and Var2. Both of them are either persons' names or entities'
names. What I want to do is to find the cases where Var1=Var2.
The problems are:
1) names are the only identifer I have;
2) both variables could contain spelling errors (e.g., Fidelity vs.
Fiedelity) or variations of one name (e.g., Fidelity management vs.
Fidelity MGMT Inc.).
I've stanardized both variables by turning them into upcases, deleting
special characters, removing special suffix (such as INC), and deleting
multiple blanks, etc.
I am wondering if functions such as SOUNDEX, SPEDIS, or COMPGED will
help here. Or something else in the fuzzy match category? (I
understand that probably no matter which method I use, I still have to
mannual check the matched result.)
Examples of Var1 look like the following:
A. Alfred Taubman
A.I.M. Overseas Ltd
ABBOTT LABS STOCK RETIREMENT TRUST
ABDULLAH TAHA BAKHSH
ABELE; JOHN E.
ACKERMANS & VAN HAAREN GROUP
ACORN FUND A SERIES OF THE ACORN INVESTM
ACORN FUND-A SERIES OF THE ACORN INVESTM
ACTINIUM HOLDING CORP
ADAMS; MARY C.
Thanks very much for your comments!