Date: Wed, 29 Dec 1999 21:01:53 -0500
Reply-To: "Dorfman, Paul" <pdorfma@CITICORP.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Dorfman, Paul" <pdorfma@CITICORP.COM>
Subject: Reality Check (was: Subselect of sas dataset)
Content-Type: text/plain; charset="iso-8859-1"
To reply privately, please write to <sashole@earthlink.net>
-----------------------------------------------------------
Subject: Reality Check (was: Subselect of sas dataset)
Respondent: Paul Dorfman
Summary: Life is short
Dirk Cosijn <dirk.cosijn@village.uunet.be> asked:
-----------------
I have a sas dataset with 250 obs. Each obs has two variables (key and
amount). There's 1 combination of 78 obs with a total amount of 250.000 I
lost a second dataset which i have used to make the combination of 78 obs.
Now I'm trying to find the combination of 78 obs (total amount=250.000)
again without my second dataset. Any idea how i can make different
combinations of 78 obs until i find a total amount of 250.000
---------------
Dirk has received a number of responses, among them, a proposal to use
POINT= option for making the combinations to be tested. Robert Virgile
responded, in part, with the following:
---------------
...As a preliminary step, convert the observations to variables:
proc transpose data=in out=all250 prefix=v;
var amount;
This gives you v1 through v250 to work with, where you are searching for all
combinations that sum to a given total. (I'm assuming you don't have any
missing values here.) As others have noted, runtime may be outrageous. But
since you are working on one observation instead of
reading many with point= on a set statement, this approach should be
workable. (If you try it, please post the results including the operating
system and runtime. You might want to test this with 30 instead of 250
variables.)...There are 250 do loops, and 250 matching end statements. Macro
language should generate them fairly easily:
data combos;
set all250;
do w1=., v1;
do w2=., v2;
do w3=., v3;
...
do w249=., v249;
do w250=., v250;
if n(of w1-w250)=78 and sum(of w1-w250)=250 then output;
end;
end;
end;
...
end;
end;
------------------
Bob emphasises that run-time may turn out to be outrageous. However, he
still thinks that if disk access is replaced by memory access, the sledge
hammer approach (i.e.: create all possible combinations of 78 elements out
of 250 and test each of them) should be workable. Whether the latter is true
or not, depends on the amount of 'rage' the outrageous run-time contains.
Before commencing on programming or testing anything, let us first do a
reality check.
The number of (78,250) combinations is well-known: C = 250!/78!/(250-78)! =
1.3374243E66. Imagine a Y3K-compatible computer running at 1 PHz (1E15 Hz)
and SAS compiler producing such a machine code that the inner-loop
instruction testing for the equality to 250 takes mere 1 millionth of a
cycle. Let us further pretend that we are lucky to have encountered the
right combination having only sifted through first 1 billionth of all
available combinations. Then the time necessary to arrive at it will be
about
1E66 * 1E-15 * 1E-6 * 1E-9 = 1E36 seconds.
Given that the time elapsed since the Big Bang is estimated at 1E19 seconds,
the sledge hammer approach as applied to the numbers M=78 and N=250 should
be considered doubtful at best rather than workable even under the fantastic
conditions assumed above, and run-time on any machine regardless of
platform, architecture, and number of processors will certainly be
outrageous, no matter what units are used to measure the 'rage', for the
program will run as long as _any_ computer is on. In the real world, solving
problems of this nature requires much narrower scope, a lot more
information, and a sizable portion of good heuristics; in other words,
clever algorithms not relying on raw power. For instance, if we knew that
the amounts are limited-range, non-zero natural numbers, a real-world
workable scheme could probably be found after a healthy workout through the
"Seminumerical Algorithms" by D.Knuth.
Of course, with 30 observations Bob offered for testing, the solution is
practically workable, as the maximum number of the combinations to be tested
reached at M=15 is 'only' C=155,117,520. However, I cannot agree with the
proposed method of creating/testing the combinations from the standpoint of
efficiency. Instead of creating only the combinations with the given number
of items M the program creates all combinations with the number of elements
from 1 to M. It increases the number of times the inner loop instruction is
executed from C = N!/M!/(M-N)! to C=2**N, i.e. M! times. Thus, it renders C
independent from M, which is apparently illogical: In the case M=1 (1-item
combinations) or M=N (N-item combination) only N and 1 combinations should
be tested, respectively, yet we have to test all 2**N. Second, the program
therefore resorts to the superfluous test n(of w1-w30)=M intended to
eliminate the combinations with the number of items less than M. Finally,
the heart of the inner loop (invariably executed 2**N times) is burdened by
SAS functions accepting variable lists, which, as a matter of rule, run
extremely slowly. Eliminating all this overhead may make the program execute
a bit faster. How much faster? As life is too short to test with M=15 and
N=30, I have only tested a program based on Bob's idea (implemented by me)
and the version representing the way I think combinations should be composed
and evaluated, with M=7 and N=25. I have made the test data so contrived
that the right combination is encountered last, and so both programs run
under the worst case input scenario:
44 %let n = 25 ;
45 %let m = 7 ;
46 %let s = %eval(19+20+21+22+23+24+25);
47
48 data a; do amt=1 to &n; output; end; run;
NOTE: The data set WORK.A has 25 observations and 1 variables.
NOTE: The DATA statement used 0.00 CPU seconds and 4990K.
49
50 %macro allcomb(m,n,s);
51 %local i;
52 drop w1-w&n;
53 %do i=1 %to &n; do w&i=., v&i; %end;
54 if n(of w1-w&n)=&m and sum(of w1-w&n)=&s then output;
55 %do i=1 %to &n; end; %end;
56 %mend allcomb;
57
58 proc transpose data=a out=t prefix=v; var amt; run;
NOTE: The data set WORK.T has 1 observations and 26 variables.
NOTE: The PROCEDURE TRANSPOSE used 0.00 CPU seconds and 4898K.
59
60 data b;
61 set t;
62 %allcomb (&m,&n,&s);
63 run;
NOTE: The data set WORK.B has 1 observations and 26 variables.
NOTE: The DATA statement used 151.72 CPU seconds and 4990K.
64
65 %macro mncomb(m,n,s);
66 %local i;
67 drop x0-x&m;
68 x0 = 0;
69 %do i=1 %to &m;
70 do x&i=1+x%eval(&i-1) to %eval(&n-&m+&i);
71 %end;
72 if 0
73 %do i=1 %to &m; +v(x&i) %end;
74 = &s then output;
75 %do i=1 %to &m; end; %end;
76 %mend mncomb;
77
78 data comb;
79 array v(&n);
80 do i=1 to &n;
81 set a;
82 v(i) = amt;
83 end;
84 %mncomb(&m,&n,&s);
85 stop;
86 run;
NOTE: The data set WORK.COMB has 1 observations and 27 variables.
NOTE: The DATA statement used 0.89 CPU seconds and 4990K.
Note that with M=7 and N=30, the macro %MNCOMB manages to evaluate the
corresponding 2035800 combinations in 3.66 CPU seconds , but the job running
%ALLCOMB got cancelled after exceeding the class limit of 600 CPU seconds.
Oh, yeah, operating system and platform: OS/390 on S/390 G5 R56 Enterprise
Server.
Happy New Year, everyone!
Kind regards,
=============================
Paul M. Dorfman
CitiCorp Universal Card
Jacksonville, FL
=============================