Date: Thu, 3 Jul 2003 16:21:06 -0700
Reply-To: cassell.david@EPAMAIL.EPA.GOV
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV>
Subject: Re: A computation question
Content-type: text/plain; charset=us-ascii
Ken Keung <1800okla@HANMAIL.NET> wrote [in part]:
> OBS C1 C2 C3 HOW_MANY TEMP
> 1 1 0 0 1 4
> 2 0 1 0 1 1
> 3 0 0 1 1 3
> 4 1 1 0 2 16
> 5 1 0 1 2 13
> 6 0 1 1 2 9
> 7 1 1 1 3 22
>
> . . . .
>
> Now, I want to produce the sas output like this.
>
> OBS C1 C2 C3 HOW_MANY
> 1 4 . . 1
> 2 . 1 . 1
> 3 . . 3 1
> 4 15 12 . 2
> 5 10 . 9 2
> 6 . 6 8 2
> 7 13 9 6 3
>
> Now the challenge is that I have 20 components, NOT 3 as shown above.
> That means there are 1,048,575 observations
> 2 to the power of 20 minus 1) in the dataset.
> After approximately 20 hours (YES! hours) waiting, my computer (1.7GHZ
and
> 512MB memory) couldn't produce the output.
Umm, I hate to sound too critical, but you have made this more difficult
than you desire.
[1] You did not state your problem clearly enough. I cannot see
precisely how
you plan to get from your first data set to your second. Do you
actually
have all (2**20-1) * K entries in a single data set, and just want
to do the
subtractions?
[2] You did *not* show your code, so we cannot see what is going wrong.
You
didn't, by chance, try to do a cartesian join to do all the
matching, thus
making your problem increase vastly in size? Trying to merge a
million records
with a milion records is going to take a *LOT* longer than trying to
merge 2**15-1 records with itself.
[3] You are making this increase exponentially with the number of
components,
so of course it is taking significantly longer as the number of
components
goes up. Your algorithm rapidly becomes crucial.. and you didn't
show it
to us.
[4] Your algorithm does not appear to have unique solutions, since the
differences are not constant. You appear to have interactions
between
components which your 'difference' table is omitting. The
interactions
may be the most important part of the process. Are you sure you
really
want to do the analysis like this?
[5] There is no way of telling from your description what your process
is,
or what you are trying to achieve, or why you need these
differences,
or what good they will do you when you are looking at over a million
of
them.
Please write back to the list (not to me personally) with soem answers
to the
above questions, and try to help us out, so that we may try to help you
out.
HTH,
David
--
David Cassell, CSC
Cassell.David@epa.gov
Senior computing specialist
mathematical statistician