Date: Tue, 8 Jul 2008 12:42:34 -0700
Reply-To: Dennis Deck <DDeck@rmccorp.com>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Dennis Deck <DDeck@rmccorp.com>
Subject: Re: fuzzy data matching
Content-Type: text/plain; charset="iso-8859-1"
Note that the Link Plus package offered by CDC that Adrian Barnett mentioned is a) free, b) readily available for download, c) easy to use, d) flexible, and e) does an excellent job. Odds are good that this will fit your needs. And it will handle the problem of matching names.
A colleague compared it to a package he developed in SAS which incorporated both deterministic and probabilistic methods and found that Link Plus held up very well. See: Campbell, Deck, Krupski (2008) Record linkage software in the public domain: a comparison of Link Plus, The Link King, and a `basic' deterministic algorithm. Health Informatics Journal, Vol. 14, No. 1, 5-15. He has made the Link King package available (http://www.the-link-king.com/) free of charge as well - but note that it requires SAS. (I considered writing an SPSS version but quickly realized it would require links to database software to work and would likely be slow.) One advantage of Link King is some added attention to linking names in the deterministic routines.
We have just started using Link Plus for a particular application. We were impressed at its flexibility. We have found it easy to exchange data with SPSS (recommend writing files to be linked as comma or tab separated value format with variable names in first row). My only complaint is that the package would benefit from a separate manual that went into a bit more depth on the linkage options (there is decent online help but no manual) .
Any linkage effort will only be as good as the variables you have in common between the files and you only list 4. Regardless of the package selected, you will need to judge where to draw the line - any probabilistic software package will provide a continuum of matches from very good to questionable and will represent this with a score. You will need to decide where to set the cut off for your particular data set, balancing the trade off between increasing the number of matches against the risk of incorrect matches. Seems like a low cutoff will work in your situation as one file is a subset of the other..
Dennis Deck, PhD
RMC Research Corporation
111 SW Columbia Street, Suite 1200
Portland, Oregon 97201-5843
voice: 503-223-8248 x715
voice: 800-788-1887 x715
====================To manage your subscription to SPSSX-L, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the
command. To leave the list, send the command
For a list of commands to manage subscriptions, send the command