Date: Tue, 19 Dec 2000 16:31:38 -0500
Reply-To: "Bross, Dean S" <dean.bross@MAIL.VA.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Bross, Dean S" <dean.bross@MAIL.VA.GOV>
Subject: Re: How to efficiently look up and classify permutations?
Content-Type: text/plain; charset="iso-8859-1"
I can suggest one way to go about this problem.
Go through the data set creating a new key which is
derived by sorting the characters in WORD into order by
their ASCII representation. This will have to be done
with data step programming, breaking WORD into an array
of numbers containing the ASCII values and writing out
the statements to sort this array. Then rebuild the
result into a new character string which is added to the
data set.
Then you can simply sort by this new character string. All
combinations of WORD which are permutations of each other
will sort together. You can use FIRST. and LAST. to
put the results together in any way you'd like.
Dean Bross
Department of Veterans Affairs
dean.bross@mail.va.gov
-----Original Message-----
From: Vyverman, Koen [mailto:koen.vyverman@FID-INTL.COM]
Sent: Monday, December 18, 2000 12:53 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Q: How to efficiently look up and classify permutations?
LS,
I'm currently faced with a quite daunting problem involving permut-
ations. It appears daunting to me because _if_ there exists a clever
solution, I fail to see it; and if not, it all becomes a matter of
efficient coding, which is -- I admit -- not really my forte ;-)
So, here's an appeal to the collective intellect of the SAS-L
efficiency gurus:
I have a number of SAS data sets containing 1 character variable WORD
of length 51. The number of observations ranges from a few thousands
up to about 90000. Within each data set, the values of WORD are unique
and case-sensitive. They may contain up to 51 characters drawn from
the following set of 56: the hyphen [-], the single straight quote ['],
the forward and backward slashes [/\], the lowercase alphabet [a..z],
and the uppercase alphabet [A..Z]. Blanks do not occur, and a given
character may appear multiple times in the same WORD value. Data sets
are unsorted.
I wish to enhance each data set with a number of variables describing
for each WORD value whether there exist other WORDs within the same
data set that are identical to the current WORD, modulo a number of
transpositions, i.e. swaps of two characters. The bit of code below
will generate a sample data set WORK.WORDLIST to illustrate my purpose.
The data are fake, but carefully chosen to highlight the issues in-
volved:
data wordlist;
informat
word $51.
swaps01-swaps03 5.
found01-found03 $500.
;
input
word=
;
cards;
word=moo swaps01=1 found01=omo
word=omo swaps01=1 found01=moo
word=roo
word=Roo
word=zoo swaps01=1 found01=ooz
word=ooz swaps01=1 found01=zoo
word=doc swaps01=2 found01=cod dco
word=cod swaps01=1 found01=doc swaps02=1 found02=dco
word=dco swaps01=1 found01=doc swaps02=1 found02=cod
word=proc swaps01=2 found01=porc crop swaps02=1 found02=pocr
word=porc swaps01=2 found01=proc pocr swaps03=1 found03=crop
word=crop swaps01=1 found01=proc swaps03=2 found03=pocr porc
word=pocr swaps01=1 found01=porc swaps02=1 found02=proc swaps03=1
found03=crop
word=ya'-f swaps02=1 found02='ya-f
word='ya-f swaps02=1 found02=ya'-f
word=will-o'-the-wisp swaps01=1 found01=will-o-'the-wisp
word=will-o-'the-wisp swaps01=1 found01=will-o'-the-wisp
;
run;
data wordlist(drop=i);
set wordlist;
array numz(i) _numeric_;
do over numz;
if numz=. then numz=0;
end;
run;
How to read all this? Consider e.g. the first two obs. They form a sub-
group having length 3 and consisting of the same characters in the same
quantities: one 'm' and two times 'o'. Hence they can be mapped onto one
another by means of a permutation which in its turn consists of a number
of successively applied character swaps. This is detailed in the SWAPSnn
and FOUNDnn variables. For 'moo', SWAPS01=1 tells us that one word was
found that can be transformed into 'moo' by just one swapping operation.
FOUND01 tells us what it is: 'omo'.
If more than one word needs to go into a FOUNDnn variable, they should
be separated by a space. The SWAPSnn variables are not essential, as
one could simply count space-delimited words in the corresponding
FOUNDnn values, but it would be nice to have them actually.
Obviously, 'omo' can also be transformed into 'moo' by applying more
than 1 transposition. This however is of no interest: the SWAPSnn and
FOUNDnn variables should precisely partition the elements of the sub-
group the current WORD is in (without listing the WORD itself, that is).
As hinted at in the above, what I call 'sub-group' is a subset of the
word-list containing those words having exactly the same character
content.
Thus, e.g. the sub-group consisting of 'proc', 'porc', 'crop', and
'pocr' has its SWAPSnn variables adding up to 3 for each of its con-
stituent words. In general, each member of a sub-group with n members
should have the sum of its SWAPS variables equalling n-1.
Note that due to the case distinction, 'roo' and 'Roo' form distinct
sub-groups ...
The idea is to start with the data set WORDLIST having only the WORD
variable, and to add only as many SWAPSnn and FOUNDnn variables to it as
is strictly necessary. Note that, since the maximal word-length is
51, we might potentially need to add as many as 50 SWAPSnn and 50
FOUNDnn variables. As can be seen from the desired output WORDLIST,
although the maximum word-length there is 16 (last 2 obs), only SWAPS01-
SWAPS03 and FOUND01-FOUND03 were added, since nothing was found that
takes more than three swapping operations anyway ...
BTW, the last two obs should discourage any solutions that scale with
the factorial of the word-length :-) (just try to imagine 50!)
A word on the context of all this: it's basically part of a data scrub-
bing exercise. I have vast quantities of manually entered data here,
with substantial numbers of typos lurking inside. Professional typists
will most of the time hit the right keys, but not always in the correct
order. I'm trying to build a tool that will allow a worker to easily
spot potential mistakes by showing what permutations of any given word
appear in the wordlist.
The environment is SAS 8.1 on WinNT. Plenty of disk space, plenty of
RAM, but not an infinity of time :-) Anyone willing to give this one
a try?
Kind Regards,
Koen Vyverman.