Date: Mon, 16 Jun 2003 10:30:13 -0400
Reply-To: Charles Patridge <charles_s_patridge@PRODIGY.NET>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Charles Patridge <charles_s_patridge@PRODIGY.NET>
Subject: Re: Actively seeking algorithm to compare the "likeness" of two
character strings
Dear Susie Li,
First let me say there are a number of commercial products on the market,
including SAS Data Quality/Data Cleansing (Proc Match) which have and do an
excellent job of matching addresses including "likeness" of addresses such
as your illustration.
Secondly, may I suggest you look at what I wrote in a paper for SAS Online
Observation (Fuzzy Matching) - you can see this on my site as well as on
SI's web site.
Thirdly, you are on the correct path to "being tired with exact matches"
and trying to deal with various statistical methods to do this work for
you. However, may I caution you in accepting the results from statistical
methods of producing 100% correct answers and then using those results in
an automated fashion for other processing.
One approach I took with your problem was to look at many of the common
variations of how numerous systems/databases/industries might try to store
the word "street" and settle on one such generic variation - say "ST".
I then do this for most of the US Postal Abbreviations of Street Addresses
and build a generic standardization of all such street addresses.
Then finally, since your example implies that 300 North Shore Street could
be coded as 300 N. Shore St. - what would prevent the same address of being
coded as "N. Shore Street - 300" and other such possibilities?
So instead of worrying about how one would find the order of such words in
a particular string, I approached the problem by ignoring the order and
looking at how many words there are to find (in this case - 4) and then
trying to find these words in another string (another address), I
standardized both strings and ignoring the order, determine how many words
out of 4 are found in its counterpart - obviously 4 out of 4 would be a
100% match but this may not always be the case which is something you will
need to deal with.
This is a more common sense point of vioew versus using a proven
statistical method but then again my needs and desires was not to find a
100% perfect methodology as I presumed there was not such a beast.
In any case, you are free to look at some of my sample code along with what
others have provided on the SAS Tips and Techniques section of the
www.sconsig.com website -
specifically you should look at tips tip00128, tip00128a, tip00000, and
then do a search while you are on the main Tips and Techniques page for
such things as "binary, fuzzy, table, lookup, cleanse, standard, match,
hash, etc" - many others on SAS-L have provided some very excellent code
and possible solutions to such problems as you described.
HTH,
Charles Patridge
Email: Charles_S_Patridge@prodigy.net
Web: http://www.sconsig.com