| Date: | Wed, 29 Apr 2009 22:00:14 -0400 |
| Reply-To: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Subject: | Re: fuzzy match problem |
|
| In-Reply-To: | <409590.18603.qm@web32406.mail.mud.yahoo.com> |
| Content-Type: | text/plain; charset="windows-1256" |
Dale:
Good point... We actually use a SAS Macro %SPEDIS() that incorporates symmetric evaluation as well as other minor adjustments. I chose in this situation to present a simpler method.
S
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Dale McLerran
Sent: Wednesday, April 29, 2009 6:13 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: fuzzy match problem
It should be noted that the SPEDIS function is asymmetric which means that SPEDIS(var1,var2)=SPEDIS(var2,var1) is NOT TRUE for all values of var1, var2. Since the SPEDIS function returns a normalized cost for converting from var2 to var1 and if there is no a priori reason to believe that var1 is the "correct" string, then it may be advisable to compute the costs of going both directions and average the two costs.
Cost = mean( (1 - (length(compress(var1)) *
spedis(compress(var1),compress(var2)) / 2400)),
(1 - (length(compress(var2)) *
spedis(compress(var2),compress(var1)) / 2400)));
Dale
---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: dmclerra@NO_SPAMfhcrc.org
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------
--- On Wed, 4/29/09, Sigurd Hermansen <HERMANS1@WESTAT.COM> wrote:
> From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
> Subject: Re: fuzzy match problem
> To: SAS-L@LISTSERV.UGA.EDU
> Date: Wednesday, April 29, 2009, 1:39 PM
> Fuzzy matching and artificial
> intelligence won't necessarily return the required results. All
> methods currently in use have some likelihood of returning correct
> results and some likelihood of returning an incorrect result. You
> likely know that good linkage method has a fairly high likelihood of
> the former and a relatively low likelihood of the latter.
>
> SAS provides several functions for fuzzy matching. I find a modified
> version of SPEDIS() a good way to generate a match "score" for the
> comparison of values of two variables.
>
> SPEDIS() computes a total cost of rearranging characters in one string
> to match characters in another string. Simple rearrangements have a
> small cost, and complex rearrangements have a high cost). This
> expression computes a match score (the closer the match, the higher
> the score) for the string values of the variables:
>
>
> (1 - (length(compress(var1)) *
>
>
> spedis(compress(var1),compress(var2)) /
>
> 2400))
>
> The 2400 weight in the expression requires a very close
> match for a score of 0.95 or higher. Those applying the expression to
> pairs of strings select a cut-off that balances the costs of false
> matches against costs of not finding correct matches. I wouldn't
> expect too much from a first attempt at separating correct from false
> matches. S
>
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU]
> On Behalf Of Terry He
> Sent: Wednesday, April 29, 2009 4:12 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: fuzzy match problem
>
>
> I have two variables. I am trying to match one variable to another.
> For example, one list has “10-K WIZARD TECHNOLOGY LLC†and the
> other has “10K WIZARD TECHNOLOGY LLCâ€. The vlookup function in
> excel will not necessarily return the required result in this case.
> how could I do it in SAS? Here is some example data:
> Var1 Var2
> 101 CALIFORNIA VENTURE @STAKE, INC
> 10K WIZARD TECHNOLOGY LLC 10-K WIZARD
> TECHNOLOGY LLC
> 13D RESEARCH INC 1E LIMITED
> 2008 MIECF 29WEST INC.
> 2C COMERCIO E IMPORTACAO DE 3 TIER TECHNOLOGY
> INC.
> 2K ADVISORS LLC 33-6 CONSULTANCY LTD
> 3 B CLIM 360 CONSULTING INC.
> 3 REASONS LTD. 360 RELOCATIONS LIMITED
> 3DADVISORS LLC 3SCOM Y.K.
> 3V CAPITAL LIMITED 3T SYSTEMS, INC
> 4 TABELIAO DE PROTESTO DE 4CAST LIMITED
> 401K COMPANY 5B TECHNOLOGIES CORP
> A G EDWARDS INC 6FIGUREJOBS.COM LLC
> A V ARKANSAS 7 CITY LEARNING LIMITED
> AAA LAUNDROMAT 9-20 RECRUITMENT LTD.
> AAA RESEARCHONE FINANCIAL A. EPSTEIN &
> SONS
> INTERNATIONAL, INC.
> ABATEX INDUSTRIA E COMERCIO A. PAPPAJOHN
> COMPANY
> ABG SUNDAL COLLIER INC A.S.A.INTERNATIONAL
> HOLDINGS LIMITED
> ABN AMRO HOLDING NV A1 EXPRESS DELIVERY
> SERVICE INC.
>
|