Date: Tue, 7 May 2002 16:27:15 -0400
Reply-To: Asheber Sewalem <sewalem@CDN.CA>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Asheber Sewalem <sewalem@CDN.CA>
Subject: Re: Hashing!
Content-Type: text/plain; charset="iso-8859-1"
Dear Paul and all,
Thanks for your useful comments and suggestions.
Actually the number of variables that I have are not many (at most 50)
and the number of records could reach up to 3 millions. In my case
although I like the idea of array, currently I am using the transpose
option and the program is still running. I will also try the array
option as well.
BTW, I want to send the output to a file. In my previous example the
(var2) is now expanded into five variable. So I want to send it
together with the other variables in a file.
>id _1 _2 _3 _4 _5 ---> the expanded (var2) variable
>
>122 1 2 3 4 .
>123 1 2 3 . .
>124 1 2 3 4 5
>
Thanks in advance.
Asheber
-----Original Message-----
From: Dorfman, Paul [mailto:Paul.Dorfman@BCBSFL.COM]
Sent: Tuesday, May 07, 2002 12:49 PM
To: SAS-L@AKH-WIEN.AC.AT
Subject: Re: Hashing!
Asheber,
It appears that TRANSPOSE can handle it fine, even if the number of
records
is rather large. Let us look at a rather extreme case with 200 distinct
IDs
and oodles of VAR values per ID:
287 data a;
288 do id = 1 to 200 ;
289 do var = 1 to ceil(ranuni(1)*32766) ;
290 output ;
291 end ;
292 end ;
293 run ;
NOTE: The data set WORK.A has 3130565 observations and 2 variables.
NOTE: DATA statement used:
real time 11.29 seconds
user cpu time 0.45 seconds
system cpu time 0.68 seconds
Memory 69k
294
295 proc transpose out = b (drop = _name_) ;
296 var var ;
297 id var ;
298 by id ;
299 run ;
NOTE: There were 3130565 observations read from the data set WORK.A.
NOTE: The data set WORK.B has 200 observations and 32006 variables.
NOTE: PROCEDURE TRANSPOSE used:
real time 1:00.73
user cpu time 46.90 seconds
system cpu time 1.21 seconds
Memory 19944k
On the other hand, if the picture were reversed and I had a whole slew
of
IDs, with a few distinct VAR values per ID, TRANSPOSE would not fare too
shabbily, either:
14 data a;
15 do id = 1 to 600000 ;
16 do var = 1 to ceil(ranuni(1)*9) ;
17 output ;
18 end ;
19 end ;
20 run ;
NOTE: The data set WORK.A has 2997205 observations and 2 variables.
NOTE: DATA statement used:
real time 11.37 seconds
user cpu time 0.73 seconds
system cpu time 0.59 seconds
Memory 69k
21 proc transpose out = b (drop=_name_) ;
22 var var ;
23 id var ;
24 by id ;
25 run ;
NOTE: There were 2997205 observations read from the data set WORK.A.
NOTE: The data set WORK.B has 600000 observations and 10 variables.
NOTE: PROCEDURE TRANSPOSE used:
real time 52.23 seconds
user cpu time 41.06 seconds
system cpu time 0.76 seconds
Memory 95k
Either way, it takes only about a minute on a pretty wussy single-chip
machine I am using to write this and running gobs of other applications.
You
are right that in relative terms, you can save a lot of run time by
using an
array instead of TRANSPOSE, at the expense of a per-pass in order to
size
the array; for example :
72 proc sql noprint ;
73 select max (var) into :size from a ;
74 quit ;
NOTE: PROCEDURE SQL used:
real time 2.15 seconds
user cpu time 1.70 seconds
system cpu time 0.24 seconds
Memory 56k
75
76 data b ( drop = var) ;
77 array _ (1 : &size) ;
78 do until (last.id) ;
79 set a ;
80 by id ;
81 _ (var) = var ;
82 end ;
83 run ;
NOTE: There were 2997205 observations read from the data set WORK.A.
NOTE: The data set WORK.B has 600000 observations and 10 variables.
NOTE: DATA statement used:
real time 12.81 seconds
user cpu time 2.23 seconds
system cpu time 0.64 seconds
Memory 98k
This method may be way preferable if the data are really huge and the
talk
is not about the difference between 12 and 60 seconds, but between 30
minutes and 2.5 hours, say. I usually prefer arrays to TRANSPOSE because
it
is much faster for me to write a Data step code than to recall the
correct
syntax for the proc, plus I feel more in control. You are the one
knowing
your data and overall situation, so should decide between different
approaches balancing all the aspects of what we call efficiency at
large:
programming time, execution time, coding parsimony, maintanence issues,
information delivery timelines, etc.
Kind regards,
================
Paul M. Dorfman
Jacksonville, FL
================
> -----Original Message-----
> From: Asheber Sewalem [mailto:sewalem@CDN.CA]
> Sent: Tuesday, May 07, 2002 8:01 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Hashing!
>
>
> Hi all,
>
> I have the following data set and program.
>
> data a;
> input id var1;
> cards;
> 122 1
> 122 2
> 122 3
> 122 5
> 123 1
> 123 3
> 123 5
> 124 1
> 124 2
> 124 2
> 124 3
> 124 4
>
> data b;set;
> by id;
> if first.id then var2=0;
> var2+1;
> proc transpose data=b out=b (drop=_name_);
> var var2;
> id var2;
> by id;
> proc print;run;
>
>
> The output looks like the following which is fine.
>
> id _1 _2 _3 _4 _5
>
> 122 1 2 3 4 .
> 123 1 2 3 . .
> 124 1 2 3 4 5
>
>
> The question is that this may not work for huge data set(millions of
> records). Is there another way of handling this kind of situation?
> (hashing or in the form of array)
>
> Thanks in advance.
>
> Asheber
>
Blue Cross Blue Shield of Florida, Inc., and its subsidiary and
affiliate companies are not responsible for errors or omissions in this
e-mail message. Any personal comments made in this e-mail do not reflect
the views of Blue Cross Blue Shield of Florida, Inc.