=========================================================================
Date: Mon, 17 Jul 2006 14:28:03 -0400
Reply-To: Richard Ristow <wrristow@mindspring.com>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Richard Ristow <wrristow@mindspring.com>
Subject: Re: FW: Identifying cases that almost match
In-Reply-To: <AEA6D84B49CB764DA4CE5DC54FD7E07F01340B77@qwizmail.previsor .com>
Content-Type: text/plain; charset=us-ascii; format=flowed;
x-avg-checked=avg-ok-7C921353
At 06:23 PM 7/16/2006, Snider-Lotz, Tom wrote:
>I'm trying to identify cases that may belong to the same individuals,
>even though their name might be entered slightly differently in the
>different records (e.g., Ben Jones and Benjamin Jones). It just
>occurred to me that I can easily solve my problem by using the
>Duplicate Cases utility to find duplicates for the variable
>ShortWholeName that I've created via the syntax.
>
>String ShortWholeName (a30).
>Compute ShortWholeName = Concat (RTRIM(Lname), ", ",
>SUBSTR(Fname,1,3)).
That's more or less how you do it: create a key that's broader - more
permissive about matching - than is the one you're having trouble with.
There's no magic. You risk false matches, though you're using a pretty
strict key that won't get many. "Robert" will match "Robin", "Samuel"
match "Samantha". But requiring a strict match on the last name will
eliminate most of those. (Worst likely case is siblings in families
that like to use similar names for
You also risk false negatives, continuing to miss true matches. In your
case, I'd worry more about that: "William" won't match "Bill",
"Elizabeth" won't match "Betty", and any variation in spelling of the
last name will spoil the match. (You may also find ambiguity about what
name is the first. I'm "Walter Richard Ristow." You know me as "Richard
Ristow", but occasional lists have me as "Walter.")
Strategy depends on how big your file is, how much work it's worth
investing, and how many keys you have; for example, you can look for
people who match on address but not on name, if you have address.
That can be a long story, though, since you then need criteria for
evaluating the quality - likelihood of being correct - of matches that
meet various combinations of criteria. I did one of these, in SAS, some
|