LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 1999, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 29 Dec 1999 21:01:53 -0500
Reply-To:     "Dorfman, Paul" <pdorfma@CITICORP.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Dorfman, Paul" <pdorfma@CITICORP.COM>
Subject:      Reality Check (was: Subselect of sas dataset)
Content-Type: text/plain; charset="iso-8859-1"

To reply privately, please write to <sashole@earthlink.net> -----------------------------------------------------------

Subject: Reality Check (was: Subselect of sas dataset) Respondent: Paul Dorfman Summary: Life is short

Dirk Cosijn <dirk.cosijn@village.uunet.be> asked:

----------------- I have a sas dataset with 250 obs. Each obs has two variables (key and amount). There's 1 combination of 78 obs with a total amount of 250.000 I lost a second dataset which i have used to make the combination of 78 obs. Now I'm trying to find the combination of 78 obs (total amount=250.000) again without my second dataset. Any idea how i can make different combinations of 78 obs until i find a total amount of 250.000 ---------------

Dirk has received a number of responses, among them, a proposal to use POINT= option for making the combinations to be tested. Robert Virgile responded, in part, with the following:

--------------- ...As a preliminary step, convert the observations to variables:

proc transpose data=in out=all250 prefix=v; var amount;

This gives you v1 through v250 to work with, where you are searching for all combinations that sum to a given total. (I'm assuming you don't have any missing values here.) As others have noted, runtime may be outrageous. But since you are working on one observation instead of reading many with point= on a set statement, this approach should be workable. (If you try it, please post the results including the operating system and runtime. You might want to test this with 30 instead of 250 variables.)...There are 250 do loops, and 250 matching end statements. Macro language should generate them fairly easily:

data combos; set all250; do w1=., v1; do w2=., v2; do w3=., v3; ... do w249=., v249; do w250=., v250; if n(of w1-w250)=78 and sum(of w1-w250)=250 then output; end; end; end; ... end; end; ------------------

Bob emphasises that run-time may turn out to be outrageous. However, he still thinks that if disk access is replaced by memory access, the sledge hammer approach (i.e.: create all possible combinations of 78 elements out of 250 and test each of them) should be workable. Whether the latter is true or not, depends on the amount of 'rage' the outrageous run-time contains. Before commencing on programming or testing anything, let us first do a reality check.

The number of (78,250) combinations is well-known: C = 250!/78!/(250-78)! = 1.3374243E66. Imagine a Y3K-compatible computer running at 1 PHz (1E15 Hz) and SAS compiler producing such a machine code that the inner-loop instruction testing for the equality to 250 takes mere 1 millionth of a cycle. Let us further pretend that we are lucky to have encountered the right combination having only sifted through first 1 billionth of all available combinations. Then the time necessary to arrive at it will be about

1E66 * 1E-15 * 1E-6 * 1E-9 = 1E36 seconds.

Given that the time elapsed since the Big Bang is estimated at 1E19 seconds, the sledge hammer approach as applied to the numbers M=78 and N=250 should be considered doubtful at best rather than workable even under the fantastic conditions assumed above, and run-time on any machine regardless of platform, architecture, and number of processors will certainly be outrageous, no matter what units are used to measure the 'rage', for the program will run as long as _any_ computer is on. In the real world, solving problems of this nature requires much narrower scope, a lot more information, and a sizable portion of good heuristics; in other words, clever algorithms not relying on raw power. For instance, if we knew that the amounts are limited-range, non-zero natural numbers, a real-world workable scheme could probably be found after a healthy workout through the "Seminumerical Algorithms" by D.Knuth.

Of course, with 30 observations Bob offered for testing, the solution is practically workable, as the maximum number of the combinations to be tested reached at M=15 is 'only' C=155,117,520. However, I cannot agree with the proposed method of creating/testing the combinations from the standpoint of efficiency. Instead of creating only the combinations with the given number of items M the program creates all combinations with the number of elements from 1 to M. It increases the number of times the inner loop instruction is executed from C = N!/M!/(M-N)! to C=2**N, i.e. M! times. Thus, it renders C independent from M, which is apparently illogical: In the case M=1 (1-item combinations) or M=N (N-item combination) only N and 1 combinations should be tested, respectively, yet we have to test all 2**N. Second, the program therefore resorts to the superfluous test n(of w1-w30)=M intended to eliminate the combinations with the number of items less than M. Finally, the heart of the inner loop (invariably executed 2**N times) is burdened by SAS functions accepting variable lists, which, as a matter of rule, run extremely slowly. Eliminating all this overhead may make the program execute a bit faster. How much faster? As life is too short to test with M=15 and N=30, I have only tested a program based on Bob's idea (implemented by me) and the version representing the way I think combinations should be composed and evaluated, with M=7 and N=25. I have made the test data so contrived that the right combination is encountered last, and so both programs run under the worst case input scenario:

44 %let n = 25 ; 45 %let m = 7 ; 46 %let s = %eval(19+20+21+22+23+24+25); 47 48 data a; do amt=1 to &n; output; end; run;

NOTE: The data set WORK.A has 25 observations and 1 variables. NOTE: The DATA statement used 0.00 CPU seconds and 4990K. 49 50 %macro allcomb(m,n,s); 51 %local i; 52 drop w1-w&n; 53 %do i=1 %to &n; do w&i=., v&i; %end; 54 if n(of w1-w&n)=&m and sum(of w1-w&n)=&s then output; 55 %do i=1 %to &n; end; %end; 56 %mend allcomb; 57 58 proc transpose data=a out=t prefix=v; var amt; run;

NOTE: The data set WORK.T has 1 observations and 26 variables. NOTE: The PROCEDURE TRANSPOSE used 0.00 CPU seconds and 4898K. 59 60 data b; 61 set t; 62 %allcomb (&m,&n,&s); 63 run;

NOTE: The data set WORK.B has 1 observations and 26 variables. NOTE: The DATA statement used 151.72 CPU seconds and 4990K. 64 65 %macro mncomb(m,n,s); 66 %local i;

67 drop x0-x&m; 68 x0 = 0; 69 %do i=1 %to &m; 70 do x&i=1+x%eval(&i-1) to %eval(&n-&m+&i); 71 %end; 72 if 0 73 %do i=1 %to &m; +v(x&i) %end; 74 = &s then output; 75 %do i=1 %to &m; end; %end; 76 %mend mncomb; 77 78 data comb; 79 array v(&n); 80 do i=1 to &n; 81 set a; 82 v(i) = amt; 83 end; 84 %mncomb(&m,&n,&s); 85 stop; 86 run; NOTE: The data set WORK.COMB has 1 observations and 27 variables. NOTE: The DATA statement used 0.89 CPU seconds and 4990K.

Note that with M=7 and N=30, the macro %MNCOMB manages to evaluate the corresponding 2035800 combinations in 3.66 CPU seconds , but the job running %ALLCOMB got cancelled after exceeding the class limit of 600 CPU seconds. Oh, yeah, operating system and platform: OS/390 on S/390 G5 R56 Enterprise Server.

Happy New Year, everyone!

Kind regards, ============================= Paul M. Dorfman CitiCorp Universal Card Jacksonville, FL =============================


Back to: Top of message | Previous page | Main SAS-L page