LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 2000, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 19 Dec 2000 16:31:38 -0500
Reply-To:     "Bross, Dean S" <dean.bross@MAIL.VA.GOV>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Bross, Dean S" <dean.bross@MAIL.VA.GOV>
Subject:      Re: How to efficiently look up and classify permutations?
Comments: To: "Vyverman, Koen" <koen.vyverman@FID-INTL.COM>
Content-Type: text/plain; charset="iso-8859-1"

I can suggest one way to go about this problem.

Go through the data set creating a new key which is derived by sorting the characters in WORD into order by their ASCII representation. This will have to be done with data step programming, breaking WORD into an array of numbers containing the ASCII values and writing out the statements to sort this array. Then rebuild the result into a new character string which is added to the data set.

Then you can simply sort by this new character string. All combinations of WORD which are permutations of each other will sort together. You can use FIRST. and LAST. to put the results together in any way you'd like.

Dean Bross Department of Veterans Affairs dean.bross@mail.va.gov

-----Original Message----- From: Vyverman, Koen [mailto:koen.vyverman@FID-INTL.COM] Sent: Monday, December 18, 2000 12:53 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Q: How to efficiently look up and classify permutations?

LS,

I'm currently faced with a quite daunting problem involving permut- ations. It appears daunting to me because _if_ there exists a clever solution, I fail to see it; and if not, it all becomes a matter of efficient coding, which is -- I admit -- not really my forte ;-) So, here's an appeal to the collective intellect of the SAS-L efficiency gurus:

I have a number of SAS data sets containing 1 character variable WORD of length 51. The number of observations ranges from a few thousands up to about 90000. Within each data set, the values of WORD are unique and case-sensitive. They may contain up to 51 characters drawn from the following set of 56: the hyphen [-], the single straight quote ['], the forward and backward slashes [/\], the lowercase alphabet [a..z], and the uppercase alphabet [A..Z]. Blanks do not occur, and a given character may appear multiple times in the same WORD value. Data sets are unsorted.

I wish to enhance each data set with a number of variables describing for each WORD value whether there exist other WORDs within the same data set that are identical to the current WORD, modulo a number of transpositions, i.e. swaps of two characters. The bit of code below will generate a sample data set WORK.WORDLIST to illustrate my purpose. The data are fake, but carefully chosen to highlight the issues in- volved:

data wordlist; informat word $51. swaps01-swaps03 5. found01-found03 $500. ; input word= ; cards; word=moo swaps01=1 found01=omo word=omo swaps01=1 found01=moo word=roo word=Roo word=zoo swaps01=1 found01=ooz word=ooz swaps01=1 found01=zoo word=doc swaps01=2 found01=cod dco word=cod swaps01=1 found01=doc swaps02=1 found02=dco word=dco swaps01=1 found01=doc swaps02=1 found02=cod word=proc swaps01=2 found01=porc crop swaps02=1 found02=pocr word=porc swaps01=2 found01=proc pocr swaps03=1 found03=crop word=crop swaps01=1 found01=proc swaps03=2 found03=pocr porc word=pocr swaps01=1 found01=porc swaps02=1 found02=proc swaps03=1 found03=crop word=ya'-f swaps02=1 found02='ya-f word='ya-f swaps02=1 found02=ya'-f word=will-o'-the-wisp swaps01=1 found01=will-o-'the-wisp word=will-o-'the-wisp swaps01=1 found01=will-o'-the-wisp ; run;

data wordlist(drop=i); set wordlist; array numz(i) _numeric_; do over numz; if numz=. then numz=0; end; run;

How to read all this? Consider e.g. the first two obs. They form a sub- group having length 3 and consisting of the same characters in the same quantities: one 'm' and two times 'o'. Hence they can be mapped onto one another by means of a permutation which in its turn consists of a number of successively applied character swaps. This is detailed in the SWAPSnn and FOUNDnn variables. For 'moo', SWAPS01=1 tells us that one word was found that can be transformed into 'moo' by just one swapping operation. FOUND01 tells us what it is: 'omo'.

If more than one word needs to go into a FOUNDnn variable, they should be separated by a space. The SWAPSnn variables are not essential, as one could simply count space-delimited words in the corresponding FOUNDnn values, but it would be nice to have them actually.

Obviously, 'omo' can also be transformed into 'moo' by applying more than 1 transposition. This however is of no interest: the SWAPSnn and FOUNDnn variables should precisely partition the elements of the sub- group the current WORD is in (without listing the WORD itself, that is). As hinted at in the above, what I call 'sub-group' is a subset of the word-list containing those words having exactly the same character content.

Thus, e.g. the sub-group consisting of 'proc', 'porc', 'crop', and 'pocr' has its SWAPSnn variables adding up to 3 for each of its con- stituent words. In general, each member of a sub-group with n members should have the sum of its SWAPS variables equalling n-1.

Note that due to the case distinction, 'roo' and 'Roo' form distinct sub-groups ...

The idea is to start with the data set WORDLIST having only the WORD variable, and to add only as many SWAPSnn and FOUNDnn variables to it as is strictly necessary. Note that, since the maximal word-length is 51, we might potentially need to add as many as 50 SWAPSnn and 50 FOUNDnn variables. As can be seen from the desired output WORDLIST, although the maximum word-length there is 16 (last 2 obs), only SWAPS01- SWAPS03 and FOUND01-FOUND03 were added, since nothing was found that takes more than three swapping operations anyway ...

BTW, the last two obs should discourage any solutions that scale with the factorial of the word-length :-) (just try to imagine 50!)

A word on the context of all this: it's basically part of a data scrub- bing exercise. I have vast quantities of manually entered data here, with substantial numbers of typos lurking inside. Professional typists will most of the time hit the right keys, but not always in the correct order. I'm trying to build a tool that will allow a worker to easily spot potential mistakes by showing what permutations of any given word appear in the wordlist.

The environment is SAS 8.1 on WinNT. Plenty of disk space, plenty of RAM, but not an infinity of time :-) Anyone willing to give this one a try?

Kind Regards, Koen Vyverman.


Back to: Top of message | Previous page | Main SAS-L page