Date: Tue, 8 Nov 2011 19:14:03 -0600
Reply-To: "Data _null_;" <iebupdte@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Data _null_;" <iebupdte@GMAIL.COM>
Subject: Re: PROC TRANSPOSE multithreading?
In-Reply-To: <CAEZCysuKq3Li+M_6UDRn9gn86PrJxTpLGc4evG+XKYQiYVEF7w@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
I made a mistake the SET should be MERGE. and add YEAR to BY statement.
*** Create really wide data;
data big;
do until(last.id);
MERGE a b c;
by id YEAR;
%inc FT36F001 / source2;
end;
run;
proc contents
On Tue, Nov 8, 2011 at 12:33 PM, Data _null_; <iebupdte@gmail.com> wrote:
> If I understand correctly the only thing we need from the data is year.
>
> See if you think this can work for you. You will need to have the
> input data sets sorted or indexed by ID.
>
> proc plan seed=921753645;
> factors id=5 of 20 year=7 of 10 random / noprint;
> output out=a year nvals=(2001 to 2010);
> run;
> factors id=4 of 20 year=7 of 10 random / noprint;
> output out=b year nvals=(2001 to 2010);
> run;
> factors id=8 of 20 year=7 of 10 random / noprint;
> output out=c year nvals=(2001 to 2010);
> run;
> quit;
>
>
> data a;
> merge a(in=in1) sashelp.heart;
> if not in1 then stop;
> run;
> proc sort; by id year; run;
> data b;
> merge b(in=in1) sashelp.shoes;
> if not in1 then stop;
> run;
> proc sort; by id year; run;
> data c;
> merge c(in=in1) sashelp.iris;
> if not in1 then stop;
> run;
> proc sort; by id year; run;
>
>
> *** find the range of years;
> data allV / view=allV;
> set a(keep=year) b(keep=year) c(keep=year);
> run;
> proc sort data=allV nodupkey out=years;
> by year;
> run;
> proc print;
> run;
> proc summary data=years;
> var year;
> output out=years2 min=min max=max;
> run;
>
> *** Get variable info;
> proc sql noprint;
> create table attribs as
> select libname, memname, name, type, length, format
> from dictionary.columns
> where libname eq 'WORK' and memname in('A' 'B' 'C');
> quit;
> proc print;
> run;
> *** Gen the ARRAY and related statements.;
> filename FT36F001 temp;
> data _null_;
> file FT36F001;
> if _n_ eq 1 then set years2;
> set attribs;
> if upcase(name) in('ID' 'YEAR') then delete;
> length len $12;
> len = ifC(type eq: 'c',cats('$',length),cats(length));
> put +3 'Array _' name +(-1) '[' min 4. ':' max 4.'] ' len @64 name
> +(-1) '_' min 4. '-' name +(-1) '_' max 4. ';';
> if not missing(format) then put +3 'Format ' name +(-1) '_:' format ';';
> put +3 'Drop ' name +(-1) ';';
> put +3 '_' name '[year]=' name +(-1) ';';
> run;
>
> *** Create really wide data;
> data big;
> do until(last.id);
> set a b c;
> by id;
> %inc FT36F001 / source2;
> end;
> run;
> proc contents varnum;
> run;
>
>
>
> On 11/8/11, Bolotin Yevgeniy <YBolotin@schools.nyc.gov> wrote:
>>> Honestly though can't you just run your transpose using prefix multiple
>>> times simultaneously?
>>
>> mostly because this chunk of code runs 6-8 hours into a titanic batch job,
>> and the preceeding code generates most of the inputs, so running things by
>> hand doesn't really work
>>
>> not to mention that running 8 separate transposes at once (on
>> multi-gigabyte, multi-million-record datasets) will probably hose hard drive
>> performance (at least much more so than having 8 processors crunching the
>> same dataset would)
>>
>>
>>
>> the separating data into individual datasets does sound like it may help,
>> will try that when I have time to see if it improves performance any,
>> thanks!
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Joe Matise [mailto:snoopy369@gmail.com]
>> Sent: Tuesday, November 08, 2011 12:10 PM
>> To: Bolotin Yevgeniy
>> Cc: SAS-L@listserv.uga.edu
>> Subject: Re: PROC TRANSPOSE multithreading?
>>
>> Not 99.44% sure but the way I'd approach it:
>>
>> * Sort by ID
>> * Output each variable as a separate row, just id+varname+value where
>> varname contains the 2007 or whatever
>> * Transpose that dataset BY ID, IDNAME=varname.
>>
>> You can separate numeric and character variables with metadata
>> programs (so you end up with 2 transposes) and you can define
>> lengths/formats/etc. on the final dataset again with a metadata
>> program. Using OPTIONS COMPRESS it won't matter that your character
>> length is something absurd for most of your variables. Since you said
>> you have a bunch of enormous datasets, you can just multithread by
>> hand - run each of the above steps once for each dataset
>> simultaneously, or 8 at a time or whatever your CPU will manage, and
>> then merge/set/whatever the resulting datasets together (I guess merge
>> but not sure I fully comprehend your data structure).
>>
>> Honestly though can't you just run your transpose using prefix
>> multiple times simultaneously? IE, 8 sas sessions, each running a
>> proc transpose and whatnot on a different CPU? Then output the
>> resulting datasets to a master directory [or use USER libname in all 8
>> sessions to have one concurrent working directory, though that could
>> be messy].
>>
>> -Joe
>>
>> On Tue, Nov 8, 2011 at 10:56 AM, Bolotin Yevgeniy
>> <YBolotin@schools.nyc.gov> wrote:
>>> "wildly different" is an understatement (numerics of various precision,
>>> characters with various lengths and formats, etc)
>>>
>>> Also not named value1-9 for the most part
>>>
>>> -----Original Message-----
>>> From: Data _null_; [mailto:iebupdte@gmail.com]
>>> Sent: Tuesday, November 08, 2011 11:53 AM
>>> To: Bolotin Yevgeniy
>>> Cc: SAS-L@listserv.uga.edu
>>> Subject: Re: PROC TRANSPOSE multithreading?
>>>
>>> Are the variables VALUE1-VALUE9 + many all the same data type?
>>>
>>> It looks like ID is unique or should be?
>>>
>>>
>>> On Tue, Nov 8, 2011 at 10:11 AM, Bolotin Yevgeniy
>>> <YBolotin@schools.nyc.gov> wrote:
>>>> (warning: big email, and has nothing to do with multithreading. If you
>>>> have ideas about the multithreading though, send them in!)
>>>>
>>>>
>>>> In this case:
>>>>
>>>> HAVE:
>>>> Dataset A: Id year value1 value2 value3;
>>>> Dataset B: Id year value4 value5 value6;
>>>> Dataset C: Id year value7 value8 value9;
>>>> ...
>>>>
>>>>
>>>> Dataset WANT: Id value1_2007 value1_2008 value1_2009 ... value2_2007
>>>> value2_2008 ... value9_2007 value9_2008 ... ;
>>>> (on the order of 2,000 output variables)
>>>>
>>>>
>>>> (actual input is about 20 enormous datasets, and around 500 variables
>>>> split randomly between them)
>>>>
>>>>
>>>> I would like this to a) be slightly less slow than it is now (a lot of
>>>> the time is actually spent CPU crunching), and b) at least somewhat
>>>> maintainable - writing individually optimized code for each transpose
>>>> will probably kill me
>>>>
>>>>
>>>>
>>>> Current code: (probably inefficient as sin, but I haven't really found
>>>> anything that would work generically and be faster)
>>>>
>>>> %macro Utility_TransposeMultiple (dsn, by, id, vars, dsn_out);
>>>>
>>>> %local i;
>>>>
>>>> libname trmtrash 'E:\RECYCLER';
>>>>
>>>> %local dsn_cleaned;
>>>>
>>>> *if any dataset options exist, strip them off for sorting;
>>>> data _null;
>>>> dsn = "&dsn.";
>>>> pos = index(dsn, "(");
>>>> if pos > 1 then
>>>> call symput ("dsn_cleaned", substr(dsn,
>>>> 1, pos-1));
>>>> else
>>>> call symput ("dsn_cleaned", dsn);
>>>> run;
>>>>
>>>> data _null_;
>>>> put "&dsn_cleaned.";
>>>> run;
>>>>
>>>>
>>>> *sort the source data, for transposing and merging;
>>>> %Utility_Sort (&dsn_cleaned., &by.);
>>>>
>>>> *transpose each variable separately, and sort the resulting dataset;
>>>> %do i = 1 %to %Utility_CountWords(&vars.);
>>>> %let var = %scan(&vars., &i);
>>>>
>>>> proc transpose
>>>> data = &dsn.
>>>> out = trmtrash.&var. (drop = _name_)
>>>> prefix = &var._;
>>>>
>>>> by &by. ; *dbn dbn_accountable:;
>>>> id &id.;
>>>> var &var. ; *attend: ;
>>>> run;
>>>>
>>>> *sort the output, we will be merging by it later;
>>>> *the data is sorted anyway, so skip for now - this
>>>> causes issues when the libname is included;
>>>> %*%sort (trmtrash.&var., &by. );
>>>>
>>>> %end;
>>>>
>>>>
>>>> *merge the transposed data into &dsn_out;
>>>> data &dsn_out.;
>>>> merge
>>>> %do i = 1 %to %Utility_CountWords(&vars.);
>>>> %let var = %scan(&vars., &i);
>>>>
>>>> trmtrash.&var.
>>>>
>>>> %end;
>>>> ;
>>>> by &by.;
>>>> run;
>>>>
>>>> %Utility_AlphabetizeVars (dsn = &dsn_out., vars_to_keep_in_front
>>>> = &by.);
>>>>
>>>>
>>>> %do i = 1 %to %Utility_CountWords(&vars.);
>>>> %let var = %scan(&vars., &i);
>>>>
>>>> %Utility_DeleteDsn(dsn = &var. , lib = trmtrash);
>>>> %end;
>>>> %mend;
>>>>
>>>>
>>>> Usage:
>>>> %Utility_TransposeMultiple (
>>>> dsn = final.register
>>>> (where=(
>>>> year>=&gl_year_less_3.
>>>> ))
>>>> , by = id
>>>> , id = year
>>>> , vars =
>>>> varA
>>>> varB
>>>> varC
>>>> varD
>>>> , dsn_out = work.register_flat
>>>> );
>>>>
>>>>
>>>> After all the datasets are crunched through this, there is an additional
>>>> Data WANT;
>>>> Set /* (list of all the datasets produced by all the
>>>> transpose_multiple calls) */;
>>>> Run;
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Data _null_; [mailto:iebupdte@gmail.com]
>>>> Sent: Tuesday, November 08, 2011 10:47 AM
>>>> To: Bolotin Yevgeniy
>>>> Cc: SAS-L@listserv.uga.edu
>>>> Subject: Re: PROC TRANSPOSE multithreading?
>>>>
>>>> If it is a one off no gain but it you run the TRANSPOSE on a regular
>>>> basis perhaps it might be worth "programming" the transpose.
>>>>
>>>> Having not seen your code I'd expect most of the steps should be
>>>> optomized.
>>>>
>>>> On 11/8/11, Bolotin Yevgeniy <YBolotin@schools.nyc.gov> wrote:
>>>>> PROC TRANSPOSE does not multithread - on a multi-processor system, it
>>>> is
>>>>> using 100% of exactly one CPU and 0% of the rest.
>>>>>
>>>>> Is there a way to force it to utilize multiple processors? Speeding up
>>>>> transposition by a factor of 8 would go a long way towards speeding up
>>>>> something that already takes way too long.
>>>>>
>>>>> (using SAS 9.2 TS2M2)
>>>>>
>>>>
>>>
>>
>
|