Date: Tue, 5 Mar 2002 15:02:41 -0800
Reply-To: Sigurd Wilson Hermansen <hermans1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Wilson Hermansen <hermans1@WESTAT.COM>
Subject: Re: Unique Patient ID for merging Healthcare / Rx Data
Content-Type: text/plain; charset=ISO-8859-1
This discussion has mixed together discussions of two issues. First,
name plus birthdate do not guarantee a distinct identifier for each
person in any set of records. Neither does name plus birthdate plus
the person's sex. If two persons named Mary Jones have the same date
of birth, knowing the sex of each probably won't help you distinguish
one from the other. Second, names have variations and those
transcribing names and birthdates tend to make lots of errors. Records
for the same person may not link because the "face values" of the
names in the two records do not match exactly.
Using more independent data items will help distinguish records for
the same person from records for different people. Cleaning and
standardizing data used to link records will help reduce the risk of
failure to link records for the same person. Unfortunately, adding
more data items increases the risk that errors in key values will
hide true matches, while blurring distinctions among key values (say,
by transforming a last name with the SOUNDEX() function) increases the
risk of false matches.
The Fuzzy Key Linkage paper that I wrote for the last SeUGI meetings
in Florence discusses the use of 'alternative keys', including how to
mitigate the computational burden of fuzzy methods. Variations on this
theme include merge/purge programs and services for mailing lists,
SI's Dataflux products and SAS Proc DQ, and probabilistic record
I have not found any easy and straightforward solutions to your
problem. You might start by partitioning your datasets into a set of
unique matches (likely correct), multiple matches (no better than
partially correct), and non-matches (at least a few failures to match
suspected if non-match rate exceeds expected rate). You can add
qualifiers to the multiple match rates to distinguish them. Fuzzy key
link methods might help you resolve the last set.
"Dave Meyer" <firstname.lastname@example.org> wrote in message news:<email@example.com>...
> I haven't heard of the Soundex algorithm...but I would like to check it
> out - any suggestions on where to start learning about it?
> TNX man,