LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2009)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 2 Jun 2009 09:52:49 -0500
Reply-To:     "Peck, Jon" <peck@spss.com>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         "Peck, Jon" <peck@spss.com>
Subject:      Re: question on 'fuzzy matching'
Comments: To: "Bibel, Daniel" <Daniel.Bibel@state.ma.us>
In-Reply-To:  A<H00005df02ae6375.1243951096.masstatepolice.pol.state.ma.us@MHS>
Content-Type: text/plain; charset="US-ASCII"

There are various specialized packages that deal with name discrepancies and other such records noise. While the FUZZY extension command can match data sources with some fuzz, the fuzz works only for numeric variables. However, there are two functions available via programmability that may help with name mismatches. The soundex and nysiis functions in the extendedTransforms.py module can code names in such a way that spelling variation differences are minimized. These can be downloaded from SPSS Developer Central (www.spss.com/devcentral). Converting the date to a regular SPSS date variable might be useful in finding gross errors (the Data Validation option can help here). A full solution, though, requires specialized software unless you want to spend a lot of time coding rules. HTH, Jon Peck

-----Original Message----- From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of Bibel, Daniel Sent: Tuesday, June 02, 2009 7:58 AM To: SPSSX-L@LISTSERV.UGA.EDU Subject: [SPSSX-L] question on 'fuzzy matching'

I have a file with about 50,000 records of individuals. Records may have been entered by different agencies at different times, and of course there are misspellings, transpositions of numbers (in date of birth, ss#, etc.)

I would like to be able to do some sort of 'fuzzy matching' to find potential duplicate cases where there may be some potential mistakes in data entry for name and dob. 'Name' is a string field, and 'DOB' is also an 8 character string field in the format of 'YYYYMMDD'.

Thanks for any help and suggestions.

Daniel Bibel Massachusetts State Police Crime Reporting Unit

===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD

===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


Back to: Top of message | Previous page | Main SPSSX-L page