LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2009, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 29 Apr 2009 22:00:14 -0400
Reply-To:   Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:   Re: fuzzy match problem
Comments:   To: Dale McLerran <stringplayer_2@YAHOO.COM>
In-Reply-To:   <409590.18603.qm@web32406.mail.mud.yahoo.com>
Content-Type:   text/plain; charset="windows-1256"

Dale: Good point... We actually use a SAS Macro %SPEDIS() that incorporates symmetric evaluation as well as other minor adjustments. I chose in this situation to present a simpler method. S

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Dale McLerran Sent: Wednesday, April 29, 2009 6:13 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: fuzzy match problem

It should be noted that the SPEDIS function is asymmetric which means that SPEDIS(var1,var2)=SPEDIS(var2,var1) is NOT TRUE for all values of var1, var2. Since the SPEDIS function returns a normalized cost for converting from var2 to var1 and if there is no a priori reason to believe that var1 is the "correct" string, then it may be advisable to compute the costs of going both directions and average the two costs.

Cost = mean( (1 - (length(compress(var1)) * spedis(compress(var1),compress(var2)) / 2400)), (1 - (length(compress(var2)) * spedis(compress(var2),compress(var1)) / 2400)));

Dale

--------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@NO_SPAMfhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 ---------------------------------------

--- On Wed, 4/29/09, Sigurd Hermansen <HERMANS1@WESTAT.COM> wrote:

> From: Sigurd Hermansen <HERMANS1@WESTAT.COM> > Subject: Re: fuzzy match problem > To: SAS-L@LISTSERV.UGA.EDU > Date: Wednesday, April 29, 2009, 1:39 PM > Fuzzy matching and artificial > intelligence won't necessarily return the required results. All > methods currently in use have some likelihood of returning correct > results and some likelihood of returning an incorrect result. You > likely know that good linkage method has a fairly high likelihood of > the former and a relatively low likelihood of the latter. > > SAS provides several functions for fuzzy matching. I find a modified > version of SPEDIS() a good way to generate a match "score" for the > comparison of values of two variables. > > SPEDIS() computes a total cost of rearranging characters in one string > to match characters in another string. Simple rearrangements have a > small cost, and complex rearrangements have a high cost). This > expression computes a match score (the closer the match, the higher > the score) for the string values of the variables: > > > (1 - (length(compress(var1)) * > > > spedis(compress(var1),compress(var2)) / > > 2400)) > > The 2400 weight in the expression requires a very close > match for a score of 0.95 or higher. Those applying the expression to > pairs of strings select a cut-off that balances the costs of false > matches against costs of not finding correct matches. I wouldn't > expect too much from a first attempt at separating correct from false > matches. S > > -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] > On Behalf Of Terry He > Sent: Wednesday, April 29, 2009 4:12 PM > To: SAS-L@LISTSERV.UGA.EDU > Subject: fuzzy match problem > > > I have two variables. I am trying to match one variable to another. > For example, one list has “10-K WIZARD TECHNOLOGY LLC and the > other has “10K WIZARD TECHNOLOGY LLC. The vlookup function in > excel will not necessarily return the required result in this case. > how could I do it in SAS? Here is some example data: > Var1 Var2 > 101 CALIFORNIA VENTURE @STAKE, INC > 10K WIZARD TECHNOLOGY LLC 10-K WIZARD > TECHNOLOGY LLC > 13D RESEARCH INC 1E LIMITED > 2008 MIECF 29WEST INC. > 2C COMERCIO E IMPORTACAO DE 3 TIER TECHNOLOGY > INC. > 2K ADVISORS LLC 33-6 CONSULTANCY LTD > 3 B CLIM 360 CONSULTING INC. > 3 REASONS LTD. 360 RELOCATIONS LIMITED > 3DADVISORS LLC 3SCOM Y.K. > 3V CAPITAL LIMITED 3T SYSTEMS, INC > 4 TABELIAO DE PROTESTO DE 4CAST LIMITED > 401K COMPANY 5B TECHNOLOGIES CORP > A G EDWARDS INC 6FIGUREJOBS.COM LLC > A V ARKANSAS 7 CITY LEARNING LIMITED > AAA LAUNDROMAT 9-20 RECRUITMENT LTD. > AAA RESEARCHONE FINANCIAL A. EPSTEIN & > SONS > INTERNATIONAL, INC. > ABATEX INDUSTRIA E COMERCIO A. PAPPAJOHN > COMPANY > ABG SUNDAL COLLIER INC A.S.A.INTERNATIONAL > HOLDINGS LIMITED > ABN AMRO HOLDING NV A1 EXPRESS DELIVERY > SERVICE INC. >


Back to: Top of message | Previous page | Main SAS-L page