LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2002, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 7 May 2002 16:27:15 -0400
Reply-To:     Asheber Sewalem <sewalem@CDN.CA>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
Comments:     To: "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
From:         Asheber Sewalem <sewalem@CDN.CA>
Subject:      Re: Hashing!
Comments: To: SAS-L@LISTSERV.VT.EDU
Content-Type: text/plain; charset="iso-8859-1"

Dear Paul and all,

Thanks for your useful comments and suggestions.

Actually the number of variables that I have are not many (at most 50) and the number of records could reach up to 3 millions. In my case although I like the idea of array, currently I am using the transpose option and the program is still running. I will also try the array option as well.

BTW, I want to send the output to a file. In my previous example the (var2) is now expanded into five variable. So I want to send it together with the other variables in a file.

>id _1 _2 _3 _4 _5 ---> the expanded (var2) variable > >122 1 2 3 4 . >123 1 2 3 . . >124 1 2 3 4 5 >

Thanks in advance.

Asheber

-----Original Message----- From: Dorfman, Paul [mailto:Paul.Dorfman@BCBSFL.COM] Sent: Tuesday, May 07, 2002 12:49 PM To: SAS-L@AKH-WIEN.AC.AT Subject: Re: Hashing!

Asheber,

It appears that TRANSPOSE can handle it fine, even if the number of records is rather large. Let us look at a rather extreme case with 200 distinct IDs and oodles of VAR values per ID:

287 data a; 288 do id = 1 to 200 ; 289 do var = 1 to ceil(ranuni(1)*32766) ; 290 output ; 291 end ; 292 end ; 293 run ;

NOTE: The data set WORK.A has 3130565 observations and 2 variables. NOTE: DATA statement used: real time 11.29 seconds user cpu time 0.45 seconds system cpu time 0.68 seconds Memory 69k 294 295 proc transpose out = b (drop = _name_) ; 296 var var ; 297 id var ; 298 by id ; 299 run ;

NOTE: There were 3130565 observations read from the data set WORK.A. NOTE: The data set WORK.B has 200 observations and 32006 variables. NOTE: PROCEDURE TRANSPOSE used: real time 1:00.73 user cpu time 46.90 seconds system cpu time 1.21 seconds Memory 19944k

On the other hand, if the picture were reversed and I had a whole slew of IDs, with a few distinct VAR values per ID, TRANSPOSE would not fare too shabbily, either:

14 data a; 15 do id = 1 to 600000 ; 16 do var = 1 to ceil(ranuni(1)*9) ; 17 output ; 18 end ; 19 end ; 20 run ;

NOTE: The data set WORK.A has 2997205 observations and 2 variables. NOTE: DATA statement used: real time 11.37 seconds user cpu time 0.73 seconds system cpu time 0.59 seconds Memory 69k

21 proc transpose out = b (drop=_name_) ; 22 var var ; 23 id var ; 24 by id ; 25 run ;

NOTE: There were 2997205 observations read from the data set WORK.A. NOTE: The data set WORK.B has 600000 observations and 10 variables. NOTE: PROCEDURE TRANSPOSE used: real time 52.23 seconds user cpu time 41.06 seconds system cpu time 0.76 seconds Memory 95k

Either way, it takes only about a minute on a pretty wussy single-chip machine I am using to write this and running gobs of other applications. You are right that in relative terms, you can save a lot of run time by using an array instead of TRANSPOSE, at the expense of a per-pass in order to size the array; for example :

72 proc sql noprint ; 73 select max (var) into :size from a ; 74 quit ; NOTE: PROCEDURE SQL used: real time 2.15 seconds user cpu time 1.70 seconds system cpu time 0.24 seconds Memory 56k 75 76 data b ( drop = var) ; 77 array _ (1 : &size) ; 78 do until (last.id) ; 79 set a ; 80 by id ; 81 _ (var) = var ; 82 end ; 83 run ;

NOTE: There were 2997205 observations read from the data set WORK.A. NOTE: The data set WORK.B has 600000 observations and 10 variables. NOTE: DATA statement used: real time 12.81 seconds user cpu time 2.23 seconds system cpu time 0.64 seconds Memory 98k

This method may be way preferable if the data are really huge and the talk is not about the difference between 12 and 60 seconds, but between 30 minutes and 2.5 hours, say. I usually prefer arrays to TRANSPOSE because it is much faster for me to write a Data step code than to recall the correct syntax for the proc, plus I feel more in control. You are the one knowing your data and overall situation, so should decide between different approaches balancing all the aspects of what we call efficiency at large: programming time, execution time, coding parsimony, maintanence issues, information delivery timelines, etc.

Kind regards, ================ Paul M. Dorfman Jacksonville, FL ================

> -----Original Message----- > From: Asheber Sewalem [mailto:sewalem@CDN.CA] > Sent: Tuesday, May 07, 2002 8:01 AM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Hashing! > > > Hi all, > > I have the following data set and program. > > data a; > input id var1; > cards; > 122 1 > 122 2 > 122 3 > 122 5 > 123 1 > 123 3 > 123 5 > 124 1 > 124 2 > 124 2 > 124 3 > 124 4 > > data b;set; > by id; > if first.id then var2=0; > var2+1; > proc transpose data=b out=b (drop=_name_); > var var2; > id var2; > by id; > proc print;run; > > > The output looks like the following which is fine. > > id _1 _2 _3 _4 _5 > > 122 1 2 3 4 . > 123 1 2 3 . . > 124 1 2 3 4 5 > > > The question is that this may not work for huge data set(millions of > records). Is there another way of handling this kind of situation? > (hashing or in the form of array) > > Thanks in advance. > > Asheber >

Blue Cross Blue Shield of Florida, Inc., and its subsidiary and affiliate companies are not responsible for errors or omissions in this e-mail message. Any personal comments made in this e-mail do not reflect the views of Blue Cross Blue Shield of Florida, Inc.


Back to: Top of message | Previous page | Main SAS-L page