Date: Tue, 9 Aug 2011 21:13:30 -0400
Reply-To: Nat Wooding <nathani@VERIZON.NET>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Nat Wooding <nathani@VERIZON.NET>
Subject: Re: Architectural approach for high performance transpose
In-Reply-To: <201108100018.p79NWK1u004212@willow.cc.uga.edu>
Content-Type: text/plain; charset="US-ASCII"
Scott
Scott
I tried playing with your code a little but could not get the %seplist part
to work.
If I run your code, I get a separate column for each value of ID but with
only one column filled with a value. The rest are missing. This comes from
having the statement
ID id;
Before I go further with this, let me ask what you want your final data set
to look like.
It may well be possible to get rid of a lot of the macro code which would
greatly simplify your program.
Thanks
Nat Wooding
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Scott
Bass
Sent: Tuesday, August 09, 2011 8:18 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Architectural approach for high performance transpose
Hi,
Sorry if this has been asked before; I did try searching the list before
posting.
I need to transpose a large dataset (50M+ records). I'd like the
performance to be as fast as possible. I'd prefer one pass over the data if
possible. I'll consider a data step transpose (array, first./last.
processing/explicit output statement), some trickery with hashing, etc. We
need to come up with a generic architectural approach that would work across
a variety of input datasets.
Getting a faster server is not currently an option ;-)
A working, self-contained example should make this clearer. This requires
two passes over the data, one for the transpose and one for the merge.
However, when I benchmarked the "real" data with 1.5M records, the bulk of
the time was spent in the transpose; the merge was actually quite quick.
If this is in fact the best approach, is there anything to make the merge
quicker? I tried using an index but that didn't work:
proc transpose data=source
out=transposed (index=(_name_));
by &by;
var &var;
id id;
run;
then adding "idxwhere=yes" during the merge.
Anyway, the example below gives the desired results, but is there a better
approach?
Thanks,
Scott
===================================
* Note: get %seplist from
http://www.devenezia.com/downloads/sas/macros/index.php?m=seplist ;
%macro loop
/*---------------------------------------------------------------------
Invoke the nested macro "%code" over a list of space separated
list of items.
---------------------------------------------------------------------*/
(__LIST__ /* Space or character separated list of items (REQ) */
,DLM=%str( ) /* Delimiter character (REQ). Default is a space. */
,MNAME=code /* Macro name (Optional). Default is "%code" */
);
%local macro parmerr __iter__ word;
%let macro = &sysmacroname;
%* the iterator MUST be unique between macro invocations ;
%* if not, nested invocations of this macro cause looping problems ;
%* unfortunately, SAS macro does not support truly private variable scoping
;
%let __iter__ = 1;
%let word = %scan(%superq(__list__),&__iter__,%superq(dlm));
%do %while (%superq(word) ne %str());
%&mname /* do not indent macro call */
%let __iter__ = %eval(&__iter__+1);
%let word = %scan(%superq(__list__),&__iter__,%superq(dlm));
%end;
%mend;
data source;
length key1 key2 8 id $1 var1 var2 8;
do key1=1 to 10;
do key2=1 to 6;
id=byte(64+key2);
var1=int(ranuni(0)*1000);
var2=int(ranuni(0)*1000);
output;
end;
end;
run;
* set by and var variables for proc transpose ;
* these will be used later for post-processing ;
%let by=key1 key2;
%let var=var1 var2;
* the bulk of the run time is in the transpose ;
proc transpose data=source
out=transposed;
by &by;
var &var;
id id;
run;
* fast ;
proc contents data=transposed out=columns (keep=name) noprint;
run;
* fast ;
proc sql noprint;
select name into :columns separated by " "
from columns
where upcase(name) not in (%seplist(%upcase(&by) _NAME_ _LABEL_,nest=QQ))
;
quit;
%put &columns;
* macro to build rename list (parses &columns) ;
%macro rename(prefix);
&word=&prefix._&word
%mend;
* macro to build dataset list (parses &var) ;
%macro datasets;
transposed (where=(_name_="&word")
rename=(%loop(&columns,mname=rename(&word))))
%mend;
* this is actually pretty fast. ;
data merged;
merge
%loop(&var,mname=datasets)
;
by &by;
drop _name_;
run;
and the log:
13824 * the bulk of the run time is in the transpose ;
13825 proc transpose data=source
13826 out=transposed;
13827 by &by;
13828 var &var;
13829 id id;
13830 run;
NOTE: There were 60 observations read from the data set WORK.SOURCE.
NOTE: The data set WORK.TRANSPOSED has 120 observations and 9 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
13831
13832 * fast ;
13833 proc contents data=transposed out=columns (keep=name) noprint;
13834 run;
NOTE: The data set WORK.COLUMNS has 9 observations and 1 variables.
NOTE: PROCEDURE CONTENTS used (Total process time):
real time 0.03 seconds
cpu time 0.01 seconds
13835
13836 * fast ;
13837 proc sql noprint;
13838 select name into :columns separated by " "
13839 from columns
13840 where upcase(name) not in (%seplist(%upcase(&by) _NAME_
_LABEL_,nest=QQ))
MPRINT(SEPLIST): "KEY1","KEY2","_NAME_","_LABEL_"
13841 ;
13842 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
13843 %put &columns;
A B C D E F
13844
13845 * macro to build rename list (parses &columns) ;
13846 %macro rename(prefix);
13847 &word=&prefix._&word
13848 %mend;
13849
13850 * macro to build dataset list (parses &var) ;
13851 %macro datasets;
13852 transposed (where=(_name_="&word")
rename=(%loop(&columns,mname=rename(&word))))
13853 %mend;
13854
13855 * this is actually pretty fast. ;
13856 data merged;
13857 merge
13858 %loop(&var,mname=datasets)
MPRINT(DATASETS): transposed (where=(_name_="var1") rename
MPRINT(DATASETS): =(A=var1_A B=var1_B C=var1_C D=var1_D E=var1_E F=var1_F))
MPRINT(DATASETS): transposed (where=(_name_="var2") rename
MPRINT(DATASETS): =(A=var2_A B=var2_B C=var2_C D=var2_D E=var2_E F=var2_F))
13859 ;
13860 by &by;
13861 drop _name_;
13862 run;
NOTE: There were 60 observations read from the data set WORK.TRANSPOSED.
WHERE _name_='var1';
NOTE: There were 60 observations read from the data set WORK.TRANSPOSED.
WHERE _name_='var2';
NOTE: The data set WORK.MERGED has 60 observations and 14 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds