LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2012, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Mon, 16 Apr 2012 11:58:02 -0700
Reply-To:   Mark Miller <mdhmiller@GMAIL.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Mark Miller <mdhmiller@GMAIL.COM>
Subject:   Re: How to change many missing variables to 0 in a single data step
Comments:   To: Randy <randistan69@hotmail.com>
In-Reply-To:   <201204161852.q3GIOOZe032728@waikiki.cc.uga.edu>
Content-Type:   text/plain; charset=UTF-8

Randy,

Are you certain of that? That is not a generalizable assertion -- but it can be true in some circumstances. Zero usually means an observed value of 0.0 ... Yes? Missing values are not "observed" data, but the absence of such.

If you're sure--- then carry on.

... Mark Miller

On Mon, Apr 16, 2012 at 11:52 AM, Randy <randistan69@hotmail.com> wrote:

> Dear All: > For the purposes of regression, missing is indeed equal to 0. > Randy > On Mon, 16 Apr 2012 18:45:41 +0000, toby dunn <tobydunn@HOTMAIL.COM> > wrote: > > >Lets See here: > > > > > >> data _null_; > >> set foo; > >> array v _numeric_; > >> do over v; > >> if missing(v) then v=0; > >> end; > >> run; > >> > >> NOTE: There were 500000 observations read from the data set WORK.FOO. > >> NOTE: DATA statement used (Total process time): > >> real time 0.63 seconds > >> user cpu time 0.56 seconds > >> system cpu time 0.07 seconds > > > > > > > > > >VS > > > > > >> data _null_; > >> set foo; > >> array v &var_names; > >> do over v; > >> if missing(v) then v=0; > >> end; > >> run; > >> > >> NOTE: There were 500000 observations read from the data set WORK.FOO. > >> NOTE: DATA statement used (Total process time): > >> real time 0.63 seconds > >> user cpu time 0.54 seconds > >> system cpu time 0.09 seconds > > > > > > > >So Real time is the same, #2 wins by .o2 on the CPU time While #1 wins by > .02 on system CPU Time.. > > > >Two things comes to mind... I dont see a clear winner and number two who > gives a flying rats ass which one is faster with numbers this > >freakin close.... > > > >In short too many people after all these years are still hung up on > speed... in reality they should be worried about readability and > maintainability of the code. > >Why? because people coming behind you will spend more time reading and > trying to understand and mainatin your code than you did writing and > testing > the damn thing.... > >Unless there is a significant difference in time why do we waste or > efforts > on eeking .02 in CPU time. > > > >In which I am still in favor of #1 over #2 because it is easier to read > and > maintain. > > > > > >This: > > > >> proc stdize data=foo out=_null_ reponly missing=0; run; > >> > >> NOTE: No VAR statement is given. All numerical variables not named > >> elsewhere make up the first set of variables. > >> NOTE: There were 500000 observations read from the data set WORK.FOO. > >> NOTE: PROCEDURE STDIZE used (Total process time): > >> real time 0.74 seconds > >> user cpu time 0.66 seconds > >> system cpu time 0.09 seconds > > > > > >Is the best so far even if it takes a hair longer to run. > > > > > > > >Toby Dunn > > > > > >If you get thrown from a horse, you have to get up and get back on, unless > you landed on a cactus; then you have to roll around and scream in pain. > > > >�Any idiot can face a crisis�it�s day to day living that wears you out� > >~ Anton Chekhov > > > > > > > >> Date: Mon, 16 Apr 2012 12:11:48 -0600 > >> From: friedegg2012@GMAIL.COM > >> Subject: Re: How to change many missing variables to 0 in a single data > step > >> To: SAS-L@LISTSERV.UGA.EDU > >> > >> The generate if statements does appear to be the quickest implementation > >> with the given problem ( ~50 columns x ~500k rows). Here is some code to > >> generate and compare the given solutions. I also expanded the miss2zero > >> macro a little work with non-standardized variable names through > >> collection. It would fit nicely into a macro function sandwich (a la > Mike > >> Rhoads) to avoid the compile and generate steps into a single call. > >> > >> /* simulate non standardizes variable names */ > >> proc sql noprint; > >> select distinct compress(Subsidiary,,'ka') > >> into :bar_arr separated by ' ' > >> from sashelp.shoes; > >> %let bar_dim=&sqlobs; > >> quit; > >> NOTE: PROCEDURE SQL used (Total process time): > >> real time 0.01 seconds > >> user cpu time 0.00 seconds > >> system cpu time 0.00 seconds > >> > >> > >> > >> /* generate 53x500,000 sample data with 40% random missing */ > >> data foo; > >> call streaminit(12345); > >> array bar[&bar_dim] &bar_arr; > >> do id=1 to 500000; > >> do _n_=1 to &bar_dim; > >> bar[_n_]=rand('uniform'); > >> if rand('table',.6,.4) > 1 then call missing(bar[_n_]); > >> end; > >> output; > >> end; > >> run; > >> > >> NOTE: The data set WORK.FOO has 500000 observations and 54 variables. > >> NOTE: DATA statement used (Total process time): > >> 2 The SAS System > >> 10:43 Monday, April 16, 2012 > >> > >> real time 3.16 seconds > >> user cpu time 2.53 seconds > >> system cpu time 0.60 seconds > >> > >> > >> > >> /* test variable imputation methods > >> will use missing() instead of =. to account for all missing > >> values i.e. =.Z */ > >> > >> *array method using _numeric_ variable list; > >> data _null_; > >> set foo; > >> array v _numeric_; > >> do over v; > >> if missing(v) then v=0; > >> end; > >> run; > >> > >> NOTE: There were 500000 observations read from the data set WORK.FOO. > >> NOTE: DATA statement used (Total process time): > >> real time 0.63 seconds > >> user cpu time 0.56 seconds > >> system cpu time 0.07 seconds > >> > >> > >> > >> *array method using collected variable list; > >> proc sql noprint; > >> select name > >> into :var_names separated by ' ' > >> from sashelp.vcolumn > >> where libname='WORK' and memname='FOO'; > >> quit; > >> NOTE: PROCEDURE SQL used (Total process time): > >> real time 0.00 seconds > >> user cpu time 0.00 seconds > >> system cpu time 0.00 seconds > >> > >> > >> > >> data _null_; > >> set foo; > >> array v &var_names; > >> do over v; > >> if missing(v) then v=0; > >> end; > >> run; > >> > >> NOTE: There were 500000 observations read from the data set WORK.FOO. > >> NOTE: DATA statement used (Total process time): > >> real time 0.63 seconds > >> user cpu time 0.54 seconds > >> system cpu time 0.09 seconds > >> > >> > >> > >> *proc stdize with reponly option, my personal favorite response to > >> this topic; > >> proc stdize data=foo out=_null_ reponly missing=0; run; > >> > >> NOTE: No VAR statement is given. All numerical variables not named > >> elsewhere make up the first set of variables. > >> NOTE: There were 500000 observations read from the data set WORK.FOO. > >> NOTE: PROCEDURE STDIZE used (Total process time): > >> real time 0.74 seconds > >> user cpu time 0.66 seconds > >> system cpu time 0.09 seconds > >> > >> > >> > >> *macro with if statements for non-standardized varaible names; > >> %macro impute_missing(action= ,libname= ,memname= ,type=num > >> ,prefix=n ,impute_value=0); > >> %if &action = compile %then %do; > >> data _null_; > >> do i=1 by 1 until(done); > >> set sashelp.vcolumn end=done; > >> where libname="%upcase(&libname)" and > >> memname="%upcase(&memname)" and type="%lowcase(&type)"; > >> call symputx(cats("g_m2z_&prefix",i),name,'g'); > >> end; > >> call symputx("g_m2z_&prefix.0",i,'g'); > >> run; > >> %end; > >> %if &action = generate %then %do; > >> %do i=1 %to &&g_m2z_&prefix.0; > >> if missing(&&&g_m2z_&prefix.&i) then &&&g_m2z_&prefix.&i=0; > >> %end; > >> %end; > >> %mend; > >> > >> %impute_missing(action=compile ,libname=WORK ,memname=FOO > >> ,type=num ,prefix=n); > >> > >> NOTE: The query as specified involves ordering by an item that doesn't > >> appear in its SELECT clause. > >> NOTE: There were 54 observations read from the data set SASHELP.VCOLUMN. > >> WHERE (libname='WORK') and (memname='FOO') and (type='num'); > >> NOTE: DATA statement used (Total process time): > >> real time 0.08 seconds > >> user cpu time 0.06 seconds > >> system cpu time 0.02 seconds > >> > >> > >> data _null_; > >> set foo; > >> %impute_missing(action=generate ,prefix=n ,impute_value=0); > >> run; > >> > >> NOTE: There were 500000 observations read from the data set WORK.FOO. > >> NOTE: DATA statement used (Total process time): > >> real time 0.28 seconds > >> user cpu time 0.18 seconds > >> system cpu time 0.09 seconds > > >


Back to: Top of message | Previous page | Main SAS-L page