Date: Sat, 30 Aug 2008 15:34:49 -0400
Reply-To: msz03@albany.edu
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Mike Zdeb <msz03@ALBANY.EDU>
Subject: Re: Sounds Like...
Content-Type: text/plain;charset=iso-8859-1
hi ... in case anyone would like to try a SOUNDEX alternative, NYSIIS coding is another phonetic
based coding scheme
there is no NYSIIS function in SAS, but a NYSIIS macro is available
http://www.albany.edu/~msz03/basug/nysiis.zip
I originally found the macro on the SAMHSA web site
http://csat.samhsa.gov/
it seems to have disappeared from that site, but it's referenced in both of the following
Linking Client Records from Substance Abuse, Mental Health and Medicaid State Agencies
http://csat.samhsa.gov/IDBSE/idb/modules/linking/material/linking.pdf
Record Linking 102
http://csat.samhsa.gov/IDBSE/idb/modules/linking/presentations/RecordLinking102_alt.ppt
here's a SOUNDEX/NYSIIS comparison (remember the original posting compared GREG and GREGORY ...
they are still different with the NYSIIS CODE)
data names;
input name : $15. @@;
soundex=soundex(name);
%nysiis(name,nysiis_code);
datalines;
SMITH SMYTHE O'HENRY OHENRY ZDEB ZDYB SOO SU CLINTON CLINTEN
WASHINGTON WOSHINGTEN TERRI TERRY JACKSON JAKSON SCHMIDT SCHMIT KALE CALE
GREG GREGORY
;
run;
nysiis_
name soundex code
SMITH S53 SNATH
SMYTHE S53 SNAT
O'HENRY O56 O'ANR
OHENRY O56 OANR
ZDEB Z31 ZDAB
ZDYB Z31 ZDAB
SOO S S
SU S S
CLINTON C4535 CLANTAN
CLINTEN C4535 CLANTAN
WASHINGTON W25235 WASANGTAN
WOSHINGTEN W25235 WASANGTAN
TERRI T6 TAR
TERRY T6 TAR
JACKSON J25 JACSAN
JAKSON J25 JACSAN
SCHMIDT S53 SNAD
SCHMIT S53 SNAT
KALE K4 CAL
CALE C4 CAL
GREG G62 GRAG
GREGORY G626 GRAGAR
if you use either SOUNDEX or NYSIIS coding, it's nice to know the coding rules
http://en.wikipedia.org/wiki/Soundex
http://en.wikipedia.org/wiki/NYSIIS
(Wikipedia does actually have them correct)
ps re Paul Choate's posting ... great to have both "DDS" and "Extraction" in the same
signature, "DDS Data Extraction" ... so, is there also a "DDS Data Root Canal"
--
Mike Zdeb
U@Albany School of Public Health
One University Place
Rensselaer, New York 12144-3456
P/518-402-6479 F/630-604-1475
> My guess would be on the second the SQL processor computes the soundex
> scores and then matches, hence a optimized join, while the first matches
> and then computes the soundex scores, hence a Cartesian join.
>
> As expected the second runs much more quickly. Interesting point Mike.
>
> Paul Choate
> DDS Data Extraction
> (916) 654-2160
>
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> Mike Zdeb
> Sent: Friday, August 29, 2008 11:33 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: Sounds Like...
>
> hi ... so if the SOUNDS LIKE operator compares the SOUNDEX of variables,
> I don't understand why
> the first use of PROC SQL below produces the NOTE
>
> NOTE: The execution of this query involves performing one or more
> Cartesian product joins that can
> not be optimized.
>
> while the second on does not
>
> can any of you SQL folk explain? thanks
>
>
> data a;
> input name1 : $10. @@;
> datalines;
> SMITH SMYTHE O'HENRY OHENRY ZDEB ZDYB SOO SU CLINTON CLINTEN
> ;
> run;
>
> data b;
> set a (rename=(name1=name2));
> run;
>
> proc sql;
> create table matches as
> select *
> from a,b
> where name1 =* name2;
> quit;
>
> proc sql;
> create table matches as
> select *
> from a,b
> where soundex(name1) = soundex(name2);
> quit;
>
>
> --
> Mike Zdeb
> U@Albany School of Public Health
> One University Place
> Rensselaer, New York 12144-3456
> P/518-402-6479 F/630-604-1475
>
>