|
And no, this is not a Sci-Fi question ;-).
Am asking on behalf of a colleague who is doing a project involving matching
based on Social Security numbers. The client wants to match two files using
the Social Security number and wants to detect as possible matches cases
where position by position 7 of the nine digits match. I am presuming that
is because they want to account for the possibility of transposed digits
(but don't know that for sure).
I've come up with what I call a brute force solution that uses the sum
function in the where clause to count position by position how many
characters match. Based on the sample below, it seems to work.
But I am wondering if any of the RegEx gurus on the list can comment on
whether this can be done with RegEx.
TIA. My sample data and quick and dirty brute force code follows.
data one;
input ssn $9.;
datalines;
123456789
987654321
121212121
343434343
;
data two;
input ssn $9.;
datalines;
123456789
987645321
121212121
232323232
;
proc sql;
create table matches as
select a.ssn as ssn_one,
b.ssn as ssn_two,
case
when (a.ssn ne b.ssn) then 'Not Exact'
else 'Exact'
end as MatchResults
from one a, two b
where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1),
substr(a.ssn,2,1)=substr(b.ssn,2,1),
substr(a.ssn,3,1)=substr(b.ssn,3,1),
substr(a.ssn,4,1)=substr(b.ssn,4,1),
substr(a.ssn,5,1)=substr(b.ssn,5,1),
substr(a.ssn,6,1)=substr(b.ssn,6,1),
substr(a.ssn,7,1)=substr(b.ssn,7,1),
substr(a.ssn,8,1)=substr(b.ssn,8,1),
substr(a.ssn,9,1)=substr(b.ssn,9,1)
) ge 7;
quit;
Don Henderson
Don.Henderson@hcsbi.com
http://www.hcsbi.com
(301) 570-5530 (office)
(301) 980-9027 (cell)
|