LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2009, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 22 Jan 2009 21:39:03 -0500
Reply-To:     Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:      Re: Regex - Seven of Nine Matching
Comments: To: Don Henderson <donaldjhenderson@HOTMAIL.COM>
In-Reply-To:  <000001c97cfd$daacfd30$0401a8c0@8300xpserver>
Content-Type: text/plain; charset="us-ascii"

Don: Generally fuzzy matching of SSN or other identifiers turns out to be a Cartesian product problem; if one has an ID that can be used to join two datasets, one doesn't need a SSN to link records for the same person in two datasets.

A 7 of 9 digit SSN match rule won't identify matching records perfectly, but may do well enough. Expansion of the number of rows won't make much difference so long as the datasets used in comparisons remain very narrow. 1-M matching will only expand the number of potential matches. Watch out for M-M matching, though, because the yield of a join will explode. Many linkage problems have duplicates in data and inadvertent M-M matches.

I've worked with Paul D. on similar problems for the last ten years. I'd look to him for technical expertise. Any neophyte in the field of fuzzy or probabilistic record linkage has a steep learning curve to climb. The literature goes back more than thirty years. S

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Don Henderson Sent: Thursday, January 22, 2009 8:57 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Regex - Seven of Nine Matching

Sig, Paul, Art,

Thanks again for the insights.

And permit me to acknowledge that I probably should have asked a few more questions before posting this question as a favor to my colleague.

First, let me answer Paul's question that yes performance is the number one criteria. My brute force approach using a cartesian product was simply a first cut at testing the viability of the idea. But there is nothing about the problem or the client that requires SQL; so a data step solution is an option. And I had thought about an approach using SET with the POINT as Paul showed in his example but figured I would defer that until later.

Now to provide some additional details about the data and what is being looked for. I can't say too much about them other than to say:

- the first file is expected to contain upwards of 3M distinct SSNs - the second one could contain anywhere from 3M to 5M rows - it is expected that most rows could/should have an exact match, but there could be SSNs in the first file that are not in the second, and vice versa. - the expected matching is that there is a (0 or 1)-M relationship between the two files where M could be 0, 1, 2, etc. - what is still unknown (my colleague needs to get clarification from the client) is whether once an SSN in the second file matches exactly to an SSN in the first file is it excluded from the 7/9 matching

Clarification of the latter point is a key item before further investigation as it will dramatically impact the number of potential matches. While I don't want to even think about the real combinations and permutations, if we had all possible 9 digit numbers (clearly not the case, but makes this rough approximation easier to think about), then by just looking at as single position, we could have (I think) 81 possible matches (9 other potential values for each of the nine positions) for each row in the first file. Factor in a second position and I think we can all see where this is going.

As of now I think I am going to assume that if an exact match is found for an SSN in the second file that it can be excluded from further analysis. The business logic behind the 1-M matching suggests that this might be a reasonable assumption.

So again, thx for all the great ideas. As I said earlier, once I get a better handle on the scope of the data and refined requirements for how exact matches are handled, I will evaluate the alternatives suggested and report back.

Thanks again, -donh

> -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > Sigurd Hermansen > Sent: Thursday, January 22, 2009 6:34 PM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Re: Regex - Seven of Nine Matching > > The crush of actual work has kept me from coming out to play on the 'L > as much as I would like. (Don't you hate that!) Paul's solution has > both inspired me to take a few minutes off to think about the problem > from a different angle and come up with a solution. > > As a first thought, I would conclude that a solution in the c or c++ > language would work far faster than a SAS solution simply because c > treats strings as character arrays and iterating through a series of > character arrays should be an order faster than substringing strings. > Paul's char() method helps reduce that advantage. > > I next wondered where I would hide if I were a more efficient database > programming solution. I immediately thought of an efficient database > operation (a join) and a way to implement one (transposing character > strings). > > This SQL solution requires a couple of scans to set up implicit array > indexes. It also minimizes the data values being carried from data > sources to the solution. It won't perform O(n), but it will scale up > to at least tens of thousands of SSN. For really large-scale problems, > one of those hyperactive hash object solutions could approach O(n). > I'll leave that to Paul, Greg, Chang, or one of the other uberhashers. > > data one; > input ssn $9.; > datalines; > 123456789 > 987654321 > 121212121 > 343434343 > ; > data two; > input ssn $9.; > datalines; > 123456789 > 987645321 > 121212121 > 232323232 > ; > data one1; > do until(EOF); > i1+1; > set one end=EOF; > do d1=1 to 9; > v1=char(ssn,d1); > output; > end; > end; > run; > data two2; > do until(EOF); > i2+1; > set two end=EOF; > do d2=1 to 9; > v2=char(ssn,d2); > output; > end; > end; > run; > proc sql; > create table matches as > select i1,i2,sum(m) as count > from (select i1,i2,d1,d2,(v1=v2) as m > from one1 as t1,two2 as t2 > where t1.d1=t2.d2 > ) > group by i1,i2 having sum(m)>=7 > ; > quit; > > -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > Paul Dorfman > Sent: Thursday, January 22, 2009 5:19 PM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Re: Regex - Seven of Nine Matching > > > Don, > > If it is performance you are primarily after (which would be logical > to seek trying to process a Cartesian product > explosion) and you still want to remain confined to SQL, I would > suggest you change from substr(x,y,1) to char(x,y). Since char() > returns a single byte, that avoids gobs of the idle blank-to-blank > comparisons. > > This simple change results in saving almost half the time in my tests > (with 9.1.3 under AIX), which I have done on your own sample data with > one record added to each file, then exploded 1000 times, so that 5000 > "ssn" numbers are compared to 5000. For example, you SQL step with > SUBSTR() ran 130 seconds, while the same step using CHAR() ran 77 > seconds. > > If you/your friend are not confined to SQL, I would rather use the > DATA step for the simple reason that its logic (absent from SQL) > allows to short-circuit the process of comparing a ssn1*ssn2 pair any > time the number of > *non-matching* digits exceeds 2, and even before that never consider a > comparison if the strings are equal. For example, > > data matches (drop = _:) ; > set one (rename = (ssn = ssn1)) ; > do p = 1 to n ; > set two (rename = (ssn = ssn2)) point = p nobs = n ; > _ne_cnt = 0 ; > if ssn1 ne ssn2 then do ; > ExactMatch = "N" ; > do _cnt = 1 to 9 while (_ne_cnt <= 2) ; > if char (ssn1, _cnt) ne char (ssn2, _cnt) then _ne_cnt ++ > 1 ; > end ; > if _cnt <= 9 then continue ; > end ; > else ExactMatch = "Y" ; > if _ne_cnt <= 2 then output ; > end ; > run ; > > cuts the SQL processing time down to 47 seconds. Eschewing the > re-reading of the second data set by placing it into an array > beforehand, > > data matches3 (drop = _:) ; > array s [99999] $ 9 _temporary_ ; > if _n_ = 1 then do _n_ = 1 to n ; > set two nobs = n; > s [_n_] = ssn ; > end ; > set one (rename = (ssn = ssn1)) ; > do _n_ = 1 to n ; > ssn2 = s[_n_] ; > _ne_cnt = 0 ; > if ssn1 ne ssn2 then do ; > ExactMatch = "N" ; > do _cnt = 1 to 9 while (_ne_cnt <= 2) ; > if char (ssn1, _cnt) ne char (ssn2, _cnt) then _ne_cnt ++ > 1 ; > end ; > if _cnt <= 9 then continue ; > end ; > else ExactMatch = "Y" ; > if _ne_cnt <= 2 then output ; > end ; > run ; > > reduces the run time further down to 32 seconds, meaning that a rather > simple DATA step logic allows increasing performance four-fold from > the original point. > > I rather like Karma's COMPLEV() approach for its elegance, and it > greatly simplifies the code, but unfortunately it is quite > computationally intensive in addition to the obvious fact that it has > to consider all the character in both strings being compared > unconditionally. I have not tested regexen, but am pretty much sure > they would fair no better for the same reason. Methinks any method > failing to cut on the number of byte-to-byte comparisons is not going > to do better than your "brute force". The only reason all the > "improvements" suggested above do work is that they, > including char() vs substr(), simply let SAS avoid extra > work. For the same reason, Mike Rhoads' suggestions are very > likely to improve run times as well. Given the simplistic > nature of any Lazy Susan logic, it is effectively a brute > force, only applied in a different direction. > > The performance difference between 30 and 130 seconds is still not a > big deal for one willing to wait an extra minute, but that may very > well change if instead of 5000*5000 SSNs to compare, each file were > mere 10 times longer. That would increase the processing time 100 > fold, and we would be talking about the run time from 4 hours to 1. > > Kind regards > ------------ > Paul Dorfman > Jax, FL > ------------ > > > On Thu, 22 Jan 2009 14:59:21 -0500, Don Henderson > <donaldjhenderson@HOTMAIL.COM> wrote: > > >Sig, Muthia and Howard, > > > >Thanks for the alternative suggestions. > > > >However, the requirement is 7 of the 9 positions match. And > I believe > >that this is mandated by some regulatory rule. So > alternative matching > >algorithms, even if they might be "better" in an abstract > sense are not > >an option. It was just speculation on my part as to the reason > (transposition) > >for 7 of the 9 match; it was not part of the requirement. > > > >Given the specifics of the requirements, I am curious as to the > performance > >issues and so I will compare the RegEx, complev and brute > force method. > The > >online doc says that COMPLEV is faster than COMPGED which is faster > >than SPEDIS for a comparison that COMPLEV can do. > > > >Thanks again for the suggestions however. > > > >Regards, > >donh > > > >-----Original Message----- > >From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > Sigurd > >Hermansen > >Sent: Thursday, January 22, 2009 1:31 PM > >To: SAS-L@LISTSERV.UGA.EDU > >Subject: Re: Regex - Seven of Nine Matching > > > >Don: > >I've generally used the SPEDIS() function (or expressions involving > >SPEDIS()) for comparing two SSN during the past decade, and I find > SPEDIS() > >a better measure of matching than any of the "number of matching > >digits" measures. It also works very fast. It does a "cost of > >rearrangement" calculation that assigns low costs to simple > >transpositions and exchanges > of > >characters. The expression I use adjusts for lengths of strings (for > >cases where a SSN has fewer than nine digits, such as when a person > >records only the last four digits) and imposes symmetry of > comparison > >(because SPEDIS() can yield different results for comparisons of the > >reverses of two SSN). > As > >a rule a SPEDIS() score of <= 0.05 indicates either a > recording error > >in > one > >or both of two SSN or SSN for two siblings (since persons > applying at > >the same time for SSN may receive consecutive numbers). S > > > >-----Original Message----- > >From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > >Don Henderson > >Sent: Thursday, January 22, 2009 12:36 PM > >To: SAS-L@LISTSERV.UGA.EDU > >Subject: Re: Regex - Seven of Nine Matching > > > > > >Karma and Chang, > > > >Thanks for these suggestions. The idea of a distance function (e.g., > >COMPLEV) had not occurred to me. > > > >I don't have access to the real data my colleague will be > using (he is > >working in a locked down environment as it is sensitive financial > >data), > but > >I will create some volume test data and run some experiments > to compare > the > >performance of the three approaches and will report back. > > > >Thanks again! > >Donh > > > >-----Original Message----- > >From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > >karma > >Sent: Thursday, January 22, 2009 12:10 PM > >To: SAS-L@LISTSERV.UGA.EDU > >Subject: Re: Regex - Seven of Nine Matching > > > >" don't think that this type of problem can be solved using regexp " > > > >Proven wrong in zero time :( > > > >2009/1/22 Chang Chung <chang_y_chung@hotmail.com>: > >> On Thu, 22 Jan 2009 10:22:02 -0500, Don Henderson > >> <donaldjhenderson@HOTMAIL.COM> wrote: > >> > >>>And no, this is not a Sci-Fi question ;-). > >>> > >>>Am asking on behalf of a colleague who is doing a project involving > >> matching > >>>based on Social Security numbers. The client wants to > match two files > >> using > >>>the Social Security number and wants to detect as possible matches > >>>cases where position by position 7 of the nine digits match. I am > >>>presuming that is because they want to account for the > possibility of > >>>transposed digits (but don't know that for sure). > >>> > >>>I've come up with what I call a brute force solution that > uses the > >>>sum function in the where clause to count position by position how > >>>many characters match. Based on the sample below, it seems to work. > >>> > >>>But I am wondering if any of the RegEx gurus on the list > can comment > >>>on whether this can be done with RegEx. > >>> > >>>TIA. My sample data and quick and dirty brute force code follows. > >>> > >>>data one; > >>> input ssn $9.; > >>>datalines; > >>>123456789 > >>>987654321 > >>>121212121 > >>>343434343 > >>>; > >>>data two; > >>> input ssn $9.; > >>>datalines; > >>>123456789 > >>>987645321 > >>>121212121 > >>>232323232 > >>>; > >>>proc sql; > >>> create table matches as > >>> select a.ssn as ssn_one, > >>> b.ssn as ssn_two, > >>> case > >>> when (a.ssn ne b.ssn) then 'Not Exact' > >>> else 'Exact' > >>> end as MatchResults > >>> from one a, two b > >>> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), > >>> substr(a.ssn,2,1)=substr(b.ssn,2,1), > >>> substr(a.ssn,3,1)=substr(b.ssn,3,1), > >>> substr(a.ssn,4,1)=substr(b.ssn,4,1), > >>> substr(a.ssn,5,1)=substr(b.ssn,5,1), > >>> substr(a.ssn,6,1)=substr(b.ssn,6,1), > >>> substr(a.ssn,7,1)=substr(b.ssn,7,1), > >>> substr(a.ssn,8,1)=substr(b.ssn,8,1), > >>> substr(a.ssn,9,1)=substr(b.ssn,9,1) > >>> ) ge 7; > >>>quit; > >> > >> hi, don, > >> here is one way. see matches2 below. sas 9.1.3. hth. cheers, chang > >> > >> > >> data one; > >> input ssn $9.; > >> datalines; > >> 123456789 > >> 987654321 > >> 121212121 > >> 343434343 > >> ; > >> data two; > >> input ssn $9.; > >> datalines; > >> 123456789 > >> 987645321 > >> 121212121 > >> 232323232 > >> ; > >> proc sql; > >> create table matches as > >> select a.ssn as ssn_one, > >> b.ssn as ssn_two, > >> case > >> when (a.ssn ne b.ssn) then 'Not Exact' > >> else 'Exact' > >> end as MatchResults > >> from one a, two b > >> where sum(substr(a.ssn,1,1)=substr(b.ssn,1,1), > >> substr(a.ssn,2,1)=substr(b.ssn,2,1), > >> substr(a.ssn,3,1)=substr(b.ssn,3,1), > >> substr(a.ssn,4,1)=substr(b.ssn,4,1), > >> substr(a.ssn,5,1)=substr(b.ssn,5,1), > >> substr(a.ssn,6,1)=substr(b.ssn,6,1), > >> substr(a.ssn,7,1)=substr(b.ssn,7,1), > >> substr(a.ssn,8,1)=substr(b.ssn,8,1), > >> substr(a.ssn,9,1)=substr(b.ssn,9,1) > >> ) ge 7; > >> > >> /* using prx */ > >> create table matches2 as > >> select a.ssn as ssn_one > >> , b.ssn as ssn_two > >> , ifc(a.ssn=b.ssn,'Exact','Not Exact') as MatchResults > >> length=9 > >> from one a, two b > >> where countc(prxchange("s/(.)(?=(.{8}\1))/M/" > >> , 9, catt(a.ssn,b.ssn)),"M")>=7; quit; > >> > >> proc compare base=matches compare=matches2; run; > >> /* on lst in part: > >> NOTE: No unequal values were found. All values compared > are exactly > >> equal. */ > >> >


Back to: Top of message | Previous page | Main SAS-L page