LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2009, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Thu, 22 Jan 2009 15:55:11 -0500
Reply-To:   Mike Rhoads <RHOADSM1@WESTAT.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Mike Rhoads <RHOADSM1@WESTAT.COM>
Subject:   Re: Regex - Seven of Nine Matching
Comments:   To: Don Henderson <donaldjhenderson@HOTMAIL.COM>
In-Reply-To:   <000301c97ccb$eaa188f0$0601a8c0@zencos.com>
Content-Type:   text/plain; charset="us-ascii"

Don,

One other possibility to think about is the COMPARE function, which simply compares two strings and returns the position of the first non-matching character (or 0 for an exact match).

So, my approach would be:

1. Call COMPARE once to try to find the first difference. 2. If another difference is possible (there was a difference prior to the last character in the string), call it a second time for the remaining (rightmost) portion of the string. 3. Similar to 2, call it a 3rd time if necessary.

If you find a 3rd difference, you don't have a match by your 7-of-9 rule. Otherwise you do.

Note that you need to use ABS with COMPARE, since it may return either a positive or negative result (depending on which of the two strings has the "greater" value).

Good luck!

Mike Rhoads RhoadsM1@Westat.com

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Don Henderson Sent: Thursday, January 22, 2009 2:59 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching

Sig, Muthia and Howard,

Thanks for the alternative suggestions.

However, the requirement is 7 of the 9 positions match. And I believe that this is mandated by some regulatory rule. So alternative matching algorithms, even if they might be "better" in an abstract sense are not an option. It was just speculation on my part as to the reason (transposition) for 7 of the 9 match; it was not part of the requirement.

Given the specifics of the requirements, I am curious as to the performance issues and so I will compare the RegEx, complev and brute force method. The online doc says that COMPLEV is faster than COMPGED which is faster than SPEDIS for a comparison that COMPLEV can do.

Thanks again for the suggestions however.

Regards, donh

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Sigurd Hermansen Sent: Thursday, January 22, 2009 1:31 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching

Don: I've generally used the SPEDIS() function (or expressions involving SPEDIS()) for comparing two SSN during the past decade, and I find SPEDIS() a better measure of matching than any of the "number of matching digits" measures. It also works very fast. It does a "cost of rearrangement" calculation that assigns low costs to simple transpositions and exchanges of characters. The expression I use adjusts for lengths of strings (for cases where a SSN has fewer than nine digits, such as when a person records only the last four digits) and imposes symmetry of comparison (because SPEDIS() can yield different results for comparisons of the reverses of two SSN). As a rule a SPEDIS() score of <= 0.05 indicates either a recording error in one or both of two SSN or SSN for two siblings (since persons applying at the same time for SSN may receive consecutive numbers). S

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Don Henderson Sent: Thursday, January 22, 2009 12:36 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching

Karma and Chang,

Thanks for these suggestions. The idea of a distance function (e.g., COMPLEV) had not occurred to me.

I don't have access to the real data my colleague will be using (he is working in a locked down environment as it is sensitive financial data), but I will create some volume test data and run some experiments to compare the performance of the three approaches and will report back.

Thanks again! Donh

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of karma Sent: Thursday, January 22, 2009 12:10 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching

" don't think that this type of problem can be solved using regexp "

Proven wrong in zero time :(

2009/1/22 Chang Chung <chang_y_chung@hotmail.com>: > On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson > <donaldjhenderson@HOTMAIL.COM> wrote: > >>And no, this is not a Sci-Fi question ;-). >> >>Am asking on behalf of a colleague who is doing a project involving > matching >>based on Social Security numbers. The client wants to match two files > using >>the Social Security number and wants to detect as possible matches >>cases where position by position 7 of the nine digits match. I am >>presuming that is because they want to account for the possibility of >>transposed digits (but don't know that for sure). >> >>I've come up with what I call a brute force solution that uses the >>sum function in the where clause to count position by position how >>many characters match. Based on the sample below, it seems to work. >> >>But I am wondering if any of the RegEx gurus on the list can comment >>on whether this can be done with RegEx. >> >>TIA. My sample data and quick and dirty brute force code follows. >> >>data one; >> input ssn $9.; >>datalines; >>123456789 >>987654321 >>121212121 >>343434343 >>; >>data two; >> input ssn $9.; >>datalines; >>123456789 >>987645321 >>121212121 >>232323232 >>; >>proc sql; >> create table matches as >> select a.ssn as ssn_one, >> b.ssn as ssn_two, >> case >> when (a.ssn ne b.ssn) then 'Not Exact' >> else 'Exact' >> end as MatchResults >> from one a, two b >> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), >> substr(a.ssn,2,1)=substr(b.ssn,2,1), >> substr(a.ssn,3,1)=substr(b.ssn,3,1), >> substr(a.ssn,4,1)=substr(b.ssn,4,1), >> substr(a.ssn,5,1)=substr(b.ssn,5,1), >> substr(a.ssn,6,1)=substr(b.ssn,6,1), >> substr(a.ssn,7,1)=substr(b.ssn,7,1), >> substr(a.ssn,8,1)=substr(b.ssn,8,1), >> substr(a.ssn,9,1)=substr(b.ssn,9,1) >> ) ge 7; >>quit; > > hi, don, > here is one way. see matches2 below. sas 9.1.3. hth. > cheers, > chang > > > data one; > input ssn $9.; > datalines; > 123456789 > 987654321 > 121212121 > 343434343 > ; > data two; > input ssn $9.; > datalines; > 123456789 > 987645321 > 121212121 > 232323232 > ; > proc sql; > create table matches as > select a.ssn as ssn_one, > b.ssn as ssn_two, > case > when (a.ssn ne b.ssn) then 'Not Exact' > else 'Exact' > end as MatchResults > from one a, two b > where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), > substr(a.ssn,2,1)=substr(b.ssn,2,1), > substr(a.ssn,3,1)=substr(b.ssn,3,1), > substr(a.ssn,4,1)=substr(b.ssn,4,1), > substr(a.ssn,5,1)=substr(b.ssn,5,1), > substr(a.ssn,6,1)=substr(b.ssn,6,1), > substr(a.ssn,7,1)=substr(b.ssn,7,1), > substr(a.ssn,8,1)=substr(b.ssn,8,1), > substr(a.ssn,9,1)=substr(b.ssn,9,1) > ) ge 7; > > /* using prx */ > create table matches2 as > select a.ssn as ssn_one > , b.ssn as ssn_two > , ifc(a.ssn=b.ssn,'Exact','Not Exact') as MatchResults > length=9 > from one a, two b > where countc(prxchange("s/(.)(?=(.{8}\1))/M/" > , 9, catt(a.ssn,b.ssn)),"M")>=7; quit; > > proc compare base=matches compare=matches2; run; > /* on lst in part: > NOTE: No unequal values were found. All values compared are exactly > equal. */ >


Back to: Top of message | Previous page | Main SAS-L page