Date: Wed, 8 Sep 2010 13:28:12 -0700
Reply-To: "Raffe, Sydelle, SSA" <DRaffe@acgov.org>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: "Raffe, Sydelle, SSA" <DRaffe@acgov.org>
Subject: Re: Name Normalization
In-Reply-To: <6080BB245C48A04BA756C983ED335EC175D91B0224@EMSCM012.sagemsmrd01.sa.gov.au>
Content-Type: multipart/alternative;
How about address normalization software?
________________________________
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of Barnett, Adrian (DECS)
Sent: Sunday, September 05, 2010 9:43 PM
To: SPSSX-L@LISTSERV.UGA.EDU
Subject: Re: Name Normalization
Hi Kevan
If your file has <10,000 records you can use the free version of LinkageWiz to do the de-duplication for you.
It has a table built in which has all the variations on each name (it refers to these as "nicknames") which it uses as part of its fuzzy matching routine (it also uses a NYSIIS or a Soundex match, and optionally will use a string similarity measure as well). Exact matches, nickname matches and phonetic matches all get different weights in computing a match score.
You can find out more at www.linkagewiz.com<http://www.linkagewiz.com/>
If you have more than 10,000 cases, I'd recommend looking at FEBRL, which is free. You can find out more about FEBRL here:
http://datamining.anu.edu.au/software/febrl/febrldoc/
What you are trying to do can in principle be done in SPSS (sort of), but it would be very hard to do it well, and would probably take more time than you have.
Adrian Barnett
Project Officer
Educational Measurement and Analysis
Data and Educational Measurement
DECS
ph 82261080
________________________________
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of Edwards, Kevan (MDH)
Sent: Friday, 3 September 2010 4:40 AM
To: SPSSX-L@LISTSERV.UGA.EDU
Subject: Name Normalization
Hello all...
Is anyone aware of a process, or a data file that can be use to normalize first names?
My goal is to be able to de-duplicate a data file that was put together from several sources of data by converting all instances or Bill, Billy, Willy, William, to William and all instances of Rob, Bob, Bobby, Robby, Robbie, Robert to Robert.
I envision using "IF" "THEN" syntax structures pointing to a data file with two variables, first the specific instance of the first name and second the normalized (standardized) format of that name.
However, I need to find the data file with common variations and a normalized version of first names and I haven't been able to find one to assist the automation of this process..
Thanks.
Kevan
----------------------------
Kevan Edwards Ph.D.
Research Scientist III
Health Economics Program, DHP/MDH
651-201-3551
[text/html]