LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2008)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 8 Jul 2008 12:42:34 -0700
Reply-To:     Dennis Deck <DDeck@rmccorp.com>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         Dennis Deck <DDeck@rmccorp.com>
Subject:      Re: fuzzy data matching
Content-Type: text/plain; charset="iso-8859-1"

Note that the Link Plus package offered by CDC that Adrian Barnett mentioned is a) free, b) readily available for download, c) easy to use, d) flexible, and e) does an excellent job. Odds are good that this will fit your needs. And it will handle the problem of matching names.

A colleague compared it to a package he developed in SAS which incorporated both deterministic and probabilistic methods and found that Link Plus held up very well. See: Campbell, Deck, Krupski (2008) Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a `basic' deterministic algorithm. Health Informatics Journal, Vol. 14, No. 1, 5-15. He has made the Link King package available (http://www.the-link-king.com/) free of charge as well - but note that it requires SAS. (I considered writing an SPSS version but quickly realized it would require links to database software to work and would likely be slow.) One advantage of Link King is some added attention to linking names in the deterministic routines.

We have just started using Link Plus for a particular application. We were impressed at its flexibility. We have found it easy to exchange data with SPSS (recommend writing files to be linked as comma or tab separated value format with variable names in first row). My only complaint is that the package would benefit from a separate manual that went into a bit more depth on the linkage options (there is decent online help but no manual) .

Any linkage effort will only be as good as the variables you have in common between the files and you only list 4. Regardless of the package selected, you will need to judge where to draw the line - any probabilistic software package will provide a continuum of matches from very good to questionable and will represent this with a score. You will need to decide where to set the cut off for your particular data set, balancing the trade off between increasing the number of matches against the risk of incorrect matches. Seems like a low cutoff will work in your situation as one file is a subset of the other..

Dennis Deck, PhD RMC Research Corporation 111 SW Columbia Street, Suite 1200 Portland, Oregon 97201-5843 voice: 503-223-8248 x715 voice: 800-788-1887 x715 fax: 503-223-8248 ddeck@rmccorp.com

====================To manage your subscription to SPSSX-L, send a message to LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


Back to: Top of message | Previous page | Main SPSSX-L page