LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2003, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 16 Jun 2003 10:30:13 -0400
Reply-To:     Charles Patridge <charles_s_patridge@PRODIGY.NET>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Charles Patridge <charles_s_patridge@PRODIGY.NET>
Subject:      Re: Actively seeking algorithm to compare the "likeness" of two
              character strings
Comments: To: Susie Li <Susie.Li@US.SANOFI.COM>

Dear Susie Li,

First let me say there are a number of commercial products on the market, including SAS Data Quality/Data Cleansing (Proc Match) which have and do an excellent job of matching addresses including "likeness" of addresses such as your illustration.

Secondly, may I suggest you look at what I wrote in a paper for SAS Online Observation (Fuzzy Matching) - you can see this on my site as well as on SI's web site.

Thirdly, you are on the correct path to "being tired with exact matches" and trying to deal with various statistical methods to do this work for you. However, may I caution you in accepting the results from statistical methods of producing 100% correct answers and then using those results in an automated fashion for other processing.

One approach I took with your problem was to look at many of the common variations of how numerous systems/databases/industries might try to store the word "street" and settle on one such generic variation - say "ST". I then do this for most of the US Postal Abbreviations of Street Addresses and build a generic standardization of all such street addresses.

Then finally, since your example implies that 300 North Shore Street could be coded as 300 N. Shore St. - what would prevent the same address of being coded as "N. Shore Street - 300" and other such possibilities?

So instead of worrying about how one would find the order of such words in a particular string, I approached the problem by ignoring the order and looking at how many words there are to find (in this case - 4) and then trying to find these words in another string (another address), I standardized both strings and ignoring the order, determine how many words out of 4 are found in its counterpart - obviously 4 out of 4 would be a 100% match but this may not always be the case which is something you will need to deal with.

This is a more common sense point of vioew versus using a proven statistical method but then again my needs and desires was not to find a 100% perfect methodology as I presumed there was not such a beast.

In any case, you are free to look at some of my sample code along with what others have provided on the SAS Tips and Techniques section of the www.sconsig.com website -

specifically you should look at tips tip00128, tip00128a, tip00000, and then do a search while you are on the main Tips and Techniques page for such things as "binary, fuzzy, table, lookup, cleanse, standard, match, hash, etc" - many others on SAS-L have provided some very excellent code and possible solutions to such problems as you described.

HTH, Charles Patridge Email: Charles_S_Patridge@prodigy.net Web: http://www.sconsig.com


Back to: Top of message | Previous page | Main SAS-L page