Date: Mon, 19 Mar 2007 18:09:39 -0700
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: Subsetting data based on similar sounding names
In-Reply-To: <200703161520.l2GAkLHx029218@malibu.cc.uga.edu>
Content-Type: text/plain; format=flowed
gerhard.hellriegel@T-ONLINE.DE replied:
>
>
>On Fri, 16 Mar 2007 10:09:23 -0400, souga soga <souga1234@GMAIL.COM> wrote:
>
> >I apologize for not explaining the task clearly.I need the output to look
> >like this:
> >
> >Anthony Tamar
> >Anthony V Tamar
> >paul V king
> >paul king
> >
> >Essentially the task is to output all the names that sound similar to any
> >other name in the dataset.
> >
> >I am hoping that someone could help me with this.
> >
> >Thanks again.
> >Sa Polo
> >
> >
> >On 3/16/07, Gerhard Hellriegel <gerhard.hellriegel@t-online.de> wrote:
> >>
> >> Sorry, but could you explane what in "paul" sounds like "Anthony"?? Ok,
>my
> >> english is bad, but that I don't see!
> >> Gerhard
> >>
> >>
> >> On Thu, 15 Mar 2007 16:51:27 -0400, souga soga <souga1234@GMAIL.COM>
> >> wrote:
> >>
> >> >Thanks, but i need only the first 4 observations as they are similar
>in
> >> the
> >> >output set and they do not have to be cleaned.
> >> >
> >> >On 3/15/07, Dominc Mitchell <mitchell.d@videotron.ca> wrote:
> >> >>
> >> >>
> >> >>
> >> >> Hi,
> >> >>
> >> >> That would work with your example. It only uses the first and last
> >> name.
> >> >> But if your data set has more complex comparison (eg typos in names)
> >> then
> >> >> you would need something more elaborate.
> >> >>
> >> >> Dominic.
> >> >>
> >> >> data x;
> >> >> length name $100;
> >> >> name="Anthony Tamar" ;output;
> >> >> name="Anthony V Tamar" ;output;
> >> >> name="paul V king" ;output;
> >> >> name ="paul king"; ;output;
> >> >> name="moon park";output;
> >> >> name="thomas li";output;
> >> >> run;
> >> >>
> >> >>
> >> >> data test;
> >> >> set x;
> >> >> name1=prxchange('s/^([a-z]+).*\s([a-z]+)/$1 $2/i',-1,name);
> >> >> proc print;
> >> >> run;
> >> >>
> >> >>
> >> >>
> >> >> -----Original Message-----
> >> >> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> >> souga
> >> >> soga
> >> >> Sent: Thursday, March 15, 2007 16:04
> >> >> To: SAS-L@LISTSERV.UGA.EDU
> >> >> Subject: Subsetting data based on similar sounding names
> >> >>
> >> >> I have a dataset which has similar names
> >> >>
> >> >> data x;
> >> >> name="Anthony Tamar" ;output;
> >> >> name="Anthony V Tamar" ;output;
> >> >> name="paul V king" ;output;
> >> >> name ="paul king"; ;output;
> >> >> name="moon park";output;
> >> >> name="thomas li";output;
> >> >> run;
> >> >>
> >> >> i would like to spit out all names that appear to be the same i.e
> >> >> rows 1 through 4.
> >> >>
> >> >> Thanks as always,
> >> >> Sa
> >> >>
> >> >>
> >>
>
>For this example the following works:
>
>data x;
> length s $4;
> name="Anthony Tamar" ; s=soundex(name); s2=substr(s,1,2); output;
> name="Anthony V Tamar" ; s=soundex(name); s2=substr(s,1,2); output;
> name="paul V king" ; s=soundex(name); s2=substr(s,1,2); output;
> name ="paul king"; s=soundex(name) ; s2=substr(s,1,2); output;
> name="moon park"; s=soundex(name); s2=substr(s,1,2); output;
> name="thomas li"; s=soundex(name); s2=substr(s,1,2); output;
>run;
>
>proc sort;
> by s2;
>run;
>
>data dups;
> set x;
> by s2;
> flag=first.s2;
> if not flag;
>run;
>
>data result;
> merge dups (in=ok)
> x;
> by s2;
> if ok;
>run;
>
>That uses only the first 2 chars from soundex(), which might be not much.
>So maybe some names which not really sound similar could be selected.
>In that case one could try to make the selection a bit more sophisticated.
>Perhaps the 2 leading chars must fit and the next two numbers must have a
>small difference.
>Just play a bit with that, maybe it works for you.
>Gerhard
>
Since I'm busy kvetching about the soundex algorithm, I really don't think
that this is a good approach.
First, remember that SOUNDEX() starts with the first letter of the word.
Then it pitches out all subsequent occurrences of vowels, or H or W or Y
(yes, I know, these three are vowels or vowel-like in at least one English-
related language). Then it lumps the remaining consonants into only 6
classes, two of which are a single letter (R and L have their own classes).
So the names 'Tacke' and 'Tugumundu' and 'Tesselin' and 'Tazman' and
'Taxation' and 'Taqueria' will all come out as 'T2' in the above scheme.
That is probably not desirable.
HTCT,
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
Interest Rates near 39yr lows! $430,000 Mortgage for $1,399/mo - Calculate
new payment
http://www.lowermybills.com/lre/index.jsp?sourceid=lmb-9632-18466&moid=7581
|