LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (March 2002, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 5 Mar 2002 15:02:41 -0800
Reply-To:     Sigurd Wilson Hermansen <hermans1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Sigurd Wilson Hermansen <hermans1@WESTAT.COM>
Organization: http://groups.google.com/
Subject:      Re: Unique Patient ID for merging Healthcare / Rx Data
Content-Type: text/plain; charset=ISO-8859-1

This discussion has mixed together discussions of two issues. First, name plus birthdate do not guarantee a distinct identifier for each person in any set of records. Neither does name plus birthdate plus the person's sex. If two persons named Mary Jones have the same date of birth, knowing the sex of each probably won't help you distinguish one from the other. Second, names have variations and those transcribing names and birthdates tend to make lots of errors. Records for the same person may not link because the "face values" of the names in the two records do not match exactly.

Using more independent data items will help distinguish records for the same person from records for different people. Cleaning and standardizing data used to link records will help reduce the risk of failure to link records for the same person. Unfortunately, adding more data items increases the risk that errors in key values will hide true matches, while blurring distinctions among key values (say, by transforming a last name with the SOUNDEX() function) increases the risk of false matches.

The Fuzzy Key Linkage paper that I wrote for the last SeUGI meetings in Florence discusses the use of 'alternative keys', including how to mitigate the computational burden of fuzzy methods. Variations on this theme include merge/purge programs and services for mailing lists, SI's Dataflux products and SAS Proc DQ, and probabilistic record linkage methods.

I have not found any easy and straightforward solutions to your problem. You might start by partitioning your datasets into a set of unique matches (likely correct), multiple matches (no better than partially correct), and non-matches (at least a few failures to match suspected if non-match rate exceeds expected rate). You can add qualifiers to the multiple match rates to distinguish them. Fuzzy key link methods might help you resolve the last set.

Sig

"Dave Meyer" <dmeyer@hoaghospital.org> wrote in message news:<6034be04e5e875dac0937c7002f39bdd.54534@mygate.mailgate.org>... > Ron, > I haven't heard of the Soundex algorithm...but I would like to check it > out - any suggestions on where to start learning about it? > TNX man, > Dave


Back to: Top of message | Previous page | Main SAS-L page