Date: Tue, 2 Jun 2009 09:52:49 -0500
Reply-To: "Peck, Jon" <peck@spss.com>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: "Peck, Jon" <peck@spss.com>
Subject: Re: question on 'fuzzy matching'
In-Reply-To: A<H00005df02ae6375.1243951096.masstatepolice.pol.state.ma.us@MHS>
Content-Type: text/plain; charset="US-ASCII"
There are various specialized packages that deal with name discrepancies and other such records noise. While the FUZZY extension command can match data sources with some fuzz, the fuzz works only for numeric variables.
However, there are two functions available via programmability that may help with name mismatches. The soundex and nysiis functions in the extendedTransforms.py module can code names in such a way that spelling variation differences are minimized.
These can be downloaded from SPSS Developer Central (www.spss.com/devcentral).
Converting the date to a regular SPSS date variable might be useful in finding gross errors (the Data Validation option can help here).
A full solution, though, requires specialized software unless you want to spend a lot of time coding rules.
HTH,
Jon Peck
-----Original Message-----
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of Bibel, Daniel
Sent: Tuesday, June 02, 2009 7:58 AM
To: SPSSX-L@LISTSERV.UGA.EDU
Subject: [SPSSX-L] question on 'fuzzy matching'
I have a file with about 50,000 records of individuals. Records may
have been entered by different agencies at different times, and of
course there are misspellings, transpositions of numbers (in date of
birth, ss#, etc.)
I would like to be able to do some sort of 'fuzzy matching' to find
potential duplicate cases where there may be some potential mistakes in
data entry for name and dob. 'Name' is a string field, and 'DOB' is
also an 8 character string field in the format of 'YYYYMMDD'.
Thanks for any help and suggestions.
Daniel Bibel
Massachusetts State Police
Crime Reporting Unit
=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
=====================
To manage your subscription to SPSSX-L, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSX-L
For a list of commands to manage subscriptions, send the command
INFO REFCARD
|