LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2011, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 8 Nov 2011 19:14:03 -0600
Reply-To:     "Data _null_;" <iebupdte@GMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Data _null_;" <iebupdte@GMAIL.COM>
Subject:      Re: PROC TRANSPOSE multithreading?
Comments: To: Bolotin Yevgeniy <YBolotin@schools.nyc.gov>
In-Reply-To:  <CAEZCysuKq3Li+M_6UDRn9gn86PrJxTpLGc4evG+XKYQiYVEF7w@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1

I made a mistake the SET should be MERGE. and add YEAR to BY statement.

*** Create really wide data; data big; do until(last.id); MERGE a b c; by id YEAR; %inc FT36F001 / source2; end; run; proc contents

On Tue, Nov 8, 2011 at 12:33 PM, Data _null_; <iebupdte@gmail.com> wrote: > If I understand correctly the only thing we need from the data is year. > > See if you think this can work for you. You will need to have the > input data sets sorted or indexed by ID. > > proc plan seed=921753645; > factors id=5 of 20 year=7 of 10 random / noprint; > output out=a year nvals=(2001 to 2010); > run; > factors id=4 of 20 year=7 of 10 random / noprint; > output out=b year nvals=(2001 to 2010); > run; > factors id=8 of 20 year=7 of 10 random / noprint; > output out=c year nvals=(2001 to 2010); > run; > quit; > > > data a; > merge a(in=in1) sashelp.heart; > if not in1 then stop; > run; > proc sort; by id year; run; > data b; > merge b(in=in1) sashelp.shoes; > if not in1 then stop; > run; > proc sort; by id year; run; > data c; > merge c(in=in1) sashelp.iris; > if not in1 then stop; > run; > proc sort; by id year; run; > > > *** find the range of years; > data allV / view=allV; > set a(keep=year) b(keep=year) c(keep=year); > run; > proc sort data=allV nodupkey out=years; > by year; > run; > proc print; > run; > proc summary data=years; > var year; > output out=years2 min=min max=max; > run; > > *** Get variable info; > proc sql noprint; > create table attribs as > select libname, memname, name, type, length, format > from dictionary.columns > where libname eq 'WORK' and memname in('A' 'B' 'C'); > quit; > proc print; > run; > *** Gen the ARRAY and related statements.; > filename FT36F001 temp; > data _null_; > file FT36F001; > if _n_ eq 1 then set years2; > set attribs; > if upcase(name) in('ID' 'YEAR') then delete; > length len $12; > len = ifC(type eq: 'c',cats('$',length),cats(length)); > put +3 'Array _' name +(-1) '[' min 4. ':' max 4.'] ' len @64 name > +(-1) '_' min 4. '-' name +(-1) '_' max 4. ';'; > if not missing(format) then put +3 'Format ' name +(-1) '_:' format ';'; > put +3 'Drop ' name +(-1) ';'; > put +3 '_' name '[year]=' name +(-1) ';'; > run; > > *** Create really wide data; > data big; > do until(last.id); > set a b c; > by id; > %inc FT36F001 / source2; > end; > run; > proc contents varnum; > run; > > > > On 11/8/11, Bolotin Yevgeniy <YBolotin@schools.nyc.gov> wrote: >>> Honestly though can't you just run your transpose using prefix multiple >>> times simultaneously? >> >> mostly because this chunk of code runs 6-8 hours into a titanic batch job, >> and the preceeding code generates most of the inputs, so running things by >> hand doesn't really work >> >> not to mention that running 8 separate transposes at once (on >> multi-gigabyte, multi-million-record datasets) will probably hose hard drive >> performance (at least much more so than having 8 processors crunching the >> same dataset would) >> >> >> >> the separating data into individual datasets does sound like it may help, >> will try that when I have time to see if it improves performance any, >> thanks! >> >> >> >> >> >> -----Original Message----- >> From: Joe Matise [mailto:snoopy369@gmail.com] >> Sent: Tuesday, November 08, 2011 12:10 PM >> To: Bolotin Yevgeniy >> Cc: SAS-L@listserv.uga.edu >> Subject: Re: PROC TRANSPOSE multithreading? >> >> Not 99.44% sure but the way I'd approach it: >> >> * Sort by ID >> * Output each variable as a separate row, just id+varname+value where >> varname contains the 2007 or whatever >> * Transpose that dataset BY ID, IDNAME=varname. >> >> You can separate numeric and character variables with metadata >> programs (so you end up with 2 transposes) and you can define >> lengths/formats/etc. on the final dataset again with a metadata >> program. Using OPTIONS COMPRESS it won't matter that your character >> length is something absurd for most of your variables. Since you said >> you have a bunch of enormous datasets, you can just multithread by >> hand - run each of the above steps once for each dataset >> simultaneously, or 8 at a time or whatever your CPU will manage, and >> then merge/set/whatever the resulting datasets together (I guess merge >> but not sure I fully comprehend your data structure). >> >> Honestly though can't you just run your transpose using prefix >> multiple times simultaneously? IE, 8 sas sessions, each running a >> proc transpose and whatnot on a different CPU? Then output the >> resulting datasets to a master directory [or use USER libname in all 8 >> sessions to have one concurrent working directory, though that could >> be messy]. >> >> -Joe >> >> On Tue, Nov 8, 2011 at 10:56 AM, Bolotin Yevgeniy >> <YBolotin@schools.nyc.gov> wrote: >>> "wildly different" is an understatement (numerics of various precision, >>> characters with various lengths and formats, etc) >>> >>> Also not named value1-9 for the most part >>> >>> -----Original Message----- >>> From: Data _null_; [mailto:iebupdte@gmail.com] >>> Sent: Tuesday, November 08, 2011 11:53 AM >>> To: Bolotin Yevgeniy >>> Cc: SAS-L@listserv.uga.edu >>> Subject: Re: PROC TRANSPOSE multithreading? >>> >>> Are the variables VALUE1-VALUE9 + many all the same data type? >>> >>> It looks like ID is unique or should be? >>> >>> >>> On Tue, Nov 8, 2011 at 10:11 AM, Bolotin Yevgeniy >>> <YBolotin@schools.nyc.gov> wrote: >>>> (warning: big email, and has nothing to do with multithreading. If you >>>> have ideas about the multithreading though, send them in!) >>>> >>>> >>>> In this case: >>>> >>>> HAVE: >>>> Dataset A: Id year value1 value2 value3; >>>> Dataset B: Id year value4 value5 value6; >>>> Dataset C: Id year value7 value8 value9; >>>> ... >>>> >>>> >>>> Dataset WANT: Id value1_2007 value1_2008 value1_2009 ... value2_2007 >>>> value2_2008 ... value9_2007 value9_2008 ... ; >>>> (on the order of 2,000 output variables) >>>> >>>> >>>> (actual input is about 20 enormous datasets, and around 500 variables >>>> split randomly between them) >>>> >>>> >>>> I would like this to a) be slightly less slow than it is now (a lot of >>>> the time is actually spent CPU crunching), and b) at least somewhat >>>> maintainable - writing individually optimized code for each transpose >>>> will probably kill me >>>> >>>> >>>> >>>> Current code: (probably inefficient as sin, but I haven't really found >>>> anything that would work generically and be faster) >>>> >>>> %macro Utility_TransposeMultiple (dsn, by, id, vars, dsn_out); >>>> >>>> %local i; >>>> >>>> libname trmtrash 'E:\RECYCLER'; >>>> >>>> %local dsn_cleaned; >>>> >>>> *if any dataset options exist, strip them off for sorting; >>>> data _null; >>>> dsn = "&dsn."; >>>> pos = index(dsn, "("); >>>> if pos > 1 then >>>> call symput ("dsn_cleaned", substr(dsn, >>>> 1, pos-1)); >>>> else >>>> call symput ("dsn_cleaned", dsn); >>>> run; >>>> >>>> data _null_; >>>> put "&dsn_cleaned."; >>>> run; >>>> >>>> >>>> *sort the source data, for transposing and merging; >>>> %Utility_Sort (&dsn_cleaned., &by.); >>>> >>>> *transpose each variable separately, and sort the resulting dataset; >>>> %do i = 1 %to %Utility_CountWords(&vars.); >>>> %let var = %scan(&vars., &i); >>>> >>>> proc transpose >>>> data = &dsn. >>>> out = trmtrash.&var. (drop = _name_) >>>> prefix = &var._; >>>> >>>> by &by. ; *dbn dbn_accountable:; >>>> id &id.; >>>> var &var. ; *attend: ; >>>> run; >>>> >>>> *sort the output, we will be merging by it later; >>>> *the data is sorted anyway, so skip for now - this >>>> causes issues when the libname is included; >>>> %*%sort (trmtrash.&var., &by. ); >>>> >>>> %end; >>>> >>>> >>>> *merge the transposed data into &dsn_out; >>>> data &dsn_out.; >>>> merge >>>> %do i = 1 %to %Utility_CountWords(&vars.); >>>> %let var = %scan(&vars., &i); >>>> >>>> trmtrash.&var. >>>> >>>> %end; >>>> ; >>>> by &by.; >>>> run; >>>> >>>> %Utility_AlphabetizeVars (dsn = &dsn_out., vars_to_keep_in_front >>>> = &by.); >>>> >>>> >>>> %do i = 1 %to %Utility_CountWords(&vars.); >>>> %let var = %scan(&vars., &i); >>>> >>>> %Utility_DeleteDsn(dsn = &var. , lib = trmtrash); >>>> %end; >>>> %mend; >>>> >>>> >>>> Usage: >>>> %Utility_TransposeMultiple ( >>>> dsn = final.register >>>> (where=( >>>> year>=&gl_year_less_3. >>>> )) >>>> , by = id >>>> , id = year >>>> , vars = >>>> varA >>>> varB >>>> varC >>>> varD >>>> , dsn_out = work.register_flat >>>> ); >>>> >>>> >>>> After all the datasets are crunched through this, there is an additional >>>> Data WANT; >>>> Set /* (list of all the datasets produced by all the >>>> transpose_multiple calls) */; >>>> Run; >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: Data _null_; [mailto:iebupdte@gmail.com] >>>> Sent: Tuesday, November 08, 2011 10:47 AM >>>> To: Bolotin Yevgeniy >>>> Cc: SAS-L@listserv.uga.edu >>>> Subject: Re: PROC TRANSPOSE multithreading? >>>> >>>> If it is a one off no gain but it you run the TRANSPOSE on a regular >>>> basis perhaps it might be worth "programming" the transpose. >>>> >>>> Having not seen your code I'd expect most of the steps should be >>>> optomized. >>>> >>>> On 11/8/11, Bolotin Yevgeniy <YBolotin@schools.nyc.gov> wrote: >>>>> PROC TRANSPOSE does not multithread - on a multi-processor system, it >>>> is >>>>> using 100% of exactly one CPU and 0% of the rest. >>>>> >>>>> Is there a way to force it to utilize multiple processors? Speeding up >>>>> transposition by a factor of 8 would go a long way towards speeding up >>>>> something that already takes way too long. >>>>> >>>>> (using SAS 9.2 TS2M2) >>>>> >>>> >>> >> >


Back to: Top of message | Previous page | Main SAS-L page