LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (August 2011, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 9 Aug 2011 21:13:30 -0400
Reply-To:     Nat Wooding <nathani@VERIZON.NET>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Nat Wooding <nathani@VERIZON.NET>
Subject:      Re: Architectural approach for high performance transpose
In-Reply-To:  <201108100018.p79NWK1u004212@willow.cc.uga.edu>
Content-Type: text/plain; charset="US-ASCII"

Scott

Scott

I tried playing with your code a little but could not get the %seplist part to work.

If I run your code, I get a separate column for each value of ID but with only one column filled with a value. The rest are missing. This comes from having the statement

ID id;

Before I go further with this, let me ask what you want your final data set to look like.

It may well be possible to get rid of a lot of the macro code which would greatly simplify your program.

Thanks

Nat Wooding -----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Scott Bass Sent: Tuesday, August 09, 2011 8:18 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Architectural approach for high performance transpose

Hi,

Sorry if this has been asked before; I did try searching the list before posting.

I need to transpose a large dataset (50M+ records). I'd like the performance to be as fast as possible. I'd prefer one pass over the data if possible. I'll consider a data step transpose (array, first./last. processing/explicit output statement), some trickery with hashing, etc. We need to come up with a generic architectural approach that would work across a variety of input datasets.

Getting a faster server is not currently an option ;-)

A working, self-contained example should make this clearer. This requires two passes over the data, one for the transpose and one for the merge.

However, when I benchmarked the "real" data with 1.5M records, the bulk of the time was spent in the transpose; the merge was actually quite quick.

If this is in fact the best approach, is there anything to make the merge quicker? I tried using an index but that didn't work:

proc transpose data=source out=transposed (index=(_name_)); by &by; var &var; id id; run;

then adding "idxwhere=yes" during the merge.

Anyway, the example below gives the desired results, but is there a better approach?

Thanks, Scott

===================================

* Note: get %seplist from http://www.devenezia.com/downloads/sas/macros/index.php?m=seplist ;

%macro loop /*--------------------------------------------------------------------- Invoke the nested macro "%code" over a list of space separated list of items. ---------------------------------------------------------------------*/ (__LIST__ /* Space or character separated list of items (REQ) */ ,DLM=%str( ) /* Delimiter character (REQ). Default is a space. */ ,MNAME=code /* Macro name (Optional). Default is "%code" */ );

%local macro parmerr __iter__ word; %let macro = &sysmacroname;

%* the iterator MUST be unique between macro invocations ; %* if not, nested invocations of this macro cause looping problems ; %* unfortunately, SAS macro does not support truly private variable scoping ;

%let __iter__ = 1; %let word = %scan(%superq(__list__),&__iter__,%superq(dlm)); %do %while (%superq(word) ne %str()); %&mname /* do not indent macro call */ %let __iter__ = %eval(&__iter__+1); %let word = %scan(%superq(__list__),&__iter__,%superq(dlm)); %end;

%mend;

data source; length key1 key2 8 id $1 var1 var2 8; do key1=1 to 10; do key2=1 to 6; id=byte(64+key2); var1=int(ranuni(0)*1000); var2=int(ranuni(0)*1000); output; end; end; run;

* set by and var variables for proc transpose ; * these will be used later for post-processing ; %let by=key1 key2; %let var=var1 var2;

* the bulk of the run time is in the transpose ; proc transpose data=source out=transposed; by &by; var &var; id id; run;

* fast ; proc contents data=transposed out=columns (keep=name) noprint; run;

* fast ; proc sql noprint; select name into :columns separated by " " from columns where upcase(name) not in (%seplist(%upcase(&by) _NAME_ _LABEL_,nest=QQ)) ; quit; %put &columns;

* macro to build rename list (parses &columns) ; %macro rename(prefix); &word=&prefix._&word %mend;

* macro to build dataset list (parses &var) ; %macro datasets; transposed (where=(_name_="&word") rename=(%loop(&columns,mname=rename(&word)))) %mend;

* this is actually pretty fast. ; data merged; merge %loop(&var,mname=datasets) ; by &by; drop _name_; run;

and the log:

13824 * the bulk of the run time is in the transpose ; 13825 proc transpose data=source 13826 out=transposed; 13827 by &by; 13828 var &var; 13829 id id; 13830 run;

NOTE: There were 60 observations read from the data set WORK.SOURCE. NOTE: The data set WORK.TRANSPOSED has 120 observations and 9 variables. NOTE: PROCEDURE TRANSPOSE used (Total process time): real time 0.01 seconds cpu time 0.01 seconds

13831 13832 * fast ; 13833 proc contents data=transposed out=columns (keep=name) noprint; 13834 run;

NOTE: The data set WORK.COLUMNS has 9 observations and 1 variables. NOTE: PROCEDURE CONTENTS used (Total process time): real time 0.03 seconds cpu time 0.01 seconds

13835 13836 * fast ; 13837 proc sql noprint; 13838 select name into :columns separated by " " 13839 from columns 13840 where upcase(name) not in (%seplist(%upcase(&by) _NAME_ _LABEL_,nest=QQ)) MPRINT(SEPLIST): "KEY1","KEY2","_NAME_","_LABEL_" 13841 ; 13842 quit; NOTE: PROCEDURE SQL used (Total process time): real time 0.01 seconds cpu time 0.01 seconds

13843 %put &columns; A B C D E F 13844 13845 * macro to build rename list (parses &columns) ; 13846 %macro rename(prefix); 13847 &word=&prefix._&word 13848 %mend; 13849 13850 * macro to build dataset list (parses &var) ; 13851 %macro datasets; 13852 transposed (where=(_name_="&word") rename=(%loop(&columns,mname=rename(&word)))) 13853 %mend; 13854 13855 * this is actually pretty fast. ; 13856 data merged; 13857 merge 13858 %loop(&var,mname=datasets) MPRINT(DATASETS): transposed (where=(_name_="var1") rename MPRINT(DATASETS): =(A=var1_A B=var1_B C=var1_C D=var1_D E=var1_E F=var1_F)) MPRINT(DATASETS): transposed (where=(_name_="var2") rename MPRINT(DATASETS): =(A=var2_A B=var2_B C=var2_C D=var2_D E=var2_E F=var2_F)) 13859 ; 13860 by &by; 13861 drop _name_; 13862 run;

NOTE: There were 60 observations read from the data set WORK.TRANSPOSED. WHERE _name_='var1'; NOTE: There were 60 observations read from the data set WORK.TRANSPOSED. WHERE _name_='var2'; NOTE: The data set WORK.MERGED has 60 observations and 14 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.03 seconds


Back to: Top of message | Previous page | Main SAS-L page