Date: Thu, 24 Jun 2004 15:47:41 -0500
Reply-To: Jack Hamilton <JackHamilton@FIRSTHEALTH.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Jack Hamilton <JackHamilton@FIRSTHEALTH.COM>
Subject: Re: DoW Loop "Duh" experience - data set and infile N+1loops
Content-Type: text/plain; charset=us-ascii
I find that behavior useful when reporting processing data step counts -
I can put subsetting IFs in my code, and know that the final PUT
statement will always be executed (not tested):
data _null_
if end then
put recsin= blahblah= purpleblahblah=;
set somestuff end=end;
recsin + 1;
if blahblah > 10;
highvalue + 1;
if color = 'purple';
purpleblahblah + 1;
run;
--
JackHamilton@FirstHealth.com
Manager, Technical Development
Metrics Department, First Health
West Sacramento, California USA
>>> "Choate, Paul@DDS" <pchoate@DDS.CA.GOV> 06/24/2004 1:23 PM >>>
Greeting SAS-Lers-
Apologies if this is old news, but while pondering Paul Dorfman & Ian's
"Do
Loop of Whitlock" I had an "a-hah" experience the other day (or maybe
it was
an "a-duh" experience) - When reading an N record data set, SAS loops
through the data-step N+1 times.
The set or infile end flag is set when the last (Nth) observation is
reached, but the data step continues into a final N+1 pass until where
it
arrives back at the set statement. That final half-pass is usually
"invisible" because of the midstream termination of the data step
before the
default output. This allows final controls of the data step to be
placed at
the top of the step rather than at the bottom, and after the final
implied
output.
data one;
do i=1 to 5;
output;
end;
data _null_;
put _all_ ': '@;
set one end=eof;
put _all_ ;
run;
eof=0 i=. _ERROR_=0 _N_=1 : eof=0 i=1 _ERROR_=0 _N_=1
eof=0 i=1 _ERROR_=0 _N_=2 : eof=0 i=2 _ERROR_=0 _N_=2
eof=0 i=2 _ERROR_=0 _N_=3 : eof=0 i=3 _ERROR_=0 _N_=3
eof=0 i=3 _ERROR_=0 _N_=4 : eof=0 i=4 _ERROR_=0 _N_=4
eof=0 i=4 _ERROR_=0 _N_=5 : eof=1 i=5 _ERROR_=0 _N_=5
eof=1 i=5 _ERROR_=0 _N_=6 :
This is the same with an input statement:
data _null_;
do i=1 to 5;
file 'one';
put i;
end;
data _null_;
put _all_ ': '@;
infile 'one' end=eof;
input I $ ;
put _all_ ;
run;
eof=0 I= _ERROR_=0 _N_=1 : eof=0 I=1 _ERROR_=0 _N_=1
eof=0 I= _ERROR_=0 _N_=2 : eof=0 I=2 _ERROR_=0 _N_=2
eof=0 I= _ERROR_=0 _N_=3 : eof=0 I=3 _ERROR_=0 _N_=3
eof=0 I= _ERROR_=0 _N_=4 : eof=0 I=4 _ERROR_=0 _N_=4
eof=0 I= _ERROR_=0 _N_=5 : eof=1 I=5 _ERROR_=0 _N_=5
eof=1 I= _ERROR_=0 _N_=6 :
Looking back at the "Flow of Action in the DATA Step" flowcharts in the
v5,
v6, v8, and v9 manual/docs I see it's been documented at least since
1985.
Thanks once to more Paul & Ian!
Paul Choate
DDS Data Extraction
(916) 654-2160
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
Paul M.
Dorfman
Sent: Thursday, June 24, 2004 11:51 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Help with By group processing, thanks
David,
Another Dorfmanism would be to convert this self-interleave into a
double-DoW. I think I have already posted it in reply to Toby, but I
would
like to stress another time that although the double-DoW:
Data one ;
do until ( last.inst ) ;
set ttotal ;
by inst ;
totfund = sum (totfund, fund, 0) ;
end ;
do until ( last.inst ) ;
set ttotal ;
by inst ;
output ;
end ;
Run ;
looks less parsimonious than the self-interleave:
data one;
set ttotal(in = summing)
ttotal(in = merging);
by inst;
if summing then do;
if first.inst then total_fund = 0;
total_fund = sum(total_fund, fund, 0);
end;
if merging then output;
run;
the double-DoW is structurally superior, as it does not rely on
conditional
logic *inside* a loop to make a decision. Instead of piling up all the
observations from two different streams into a single by-pile and
relying on
IN= to split them, the double-DoW simply goes through each by-group
twice:
first coming from one input stream, then - from the other. And because
the
boundaries of the double-Dow coincide with those of the Data step
itself,
there is no need to initialize the cumulative variable explicitly
using
first.inst.
As an exercise for curiousity, one may want to try foreseeing, without
testing, what will happen to the output if the second BY statement in
the
double-DoW code is omitted or commented out, then run a test to see if
the
guess was right. Now here us a different variation on the same
double-DoW
theme:
data two ;
do count = 1 by 1 until ( last.inst ) ;
set ttotal ;
by inst ;
totfund = sum (totfund, fund, 0) ;
end ;
do _n_ = 1 to count ;
set ttotal ;
output ;
end ;
run ;
This can be more efficient than the first variant if, in addition to
the
sum, count is also needed. Note that in this case, the second BY
statement
is omitted, yet the output is as expected. I will let curious SAS-Lers
ponder how this works.
Kind regards,
----------------
Paul M. Dorfman
Jacksonville, FL
----------------
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> Behalf Of David L. Cassell
> Sent: Wednesday, June 23, 2004 8:52 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: Help with By group processing, thanks
>
> "Dunn, Toby" <tdunn@TEA.STATE.TX.US> sagely replied:
> > As a data step solution, consider this option:
> >
> > data one;
> > set ttotal (in = a)
> > ttotal (in = b);
> > by inst;
> >
> > if (a = 1) then do;
> > if first.inst then total_fund = 0;
> > if (fund ne .) then total_fund + fund; end;
> >
> > if (b = 1) then do;
> > output;
> > end;
> > run;
>
> A good solution. I have privately referred to this technique
> as the "Schreier Self-interleave", because I learned it from
> reading some of Howard's posts some time ago. IIRC, Howard
> once said that he learned it from Ian. But we can't name
> *everything* after Ian. :-) I think it is time to take the
> approach of the mathematicians; they couldn't name everything
> after Gauss, so eventually they had to name things after the
> *next* person to work with them.
>
> There is one important point I would like to make with this example.
> In terms of documentation and maintenance (by others), I find
> it is really helpful to use better names with my IN= options.
> So I might re- label this data step like so:
>
> data one;
> set ttotal(in = summing)
> ttotal(in = merging);
> by inst;
>
> if summing then do;
> if first.inst then total_fund = 0;
> total_fund = sum(total_fund, fund, 0);
> end;
>
> if merging then output;
>
> run;
>
>
> And, of course, one can always produce a Dorfmanism to turn
> the above do-group into a single computation without the need
> for grouping. But that kind of shoots down the whole 'make
> it more readable and maintainable'
> point. :-)
>
> Okay, okay, here's what I meant [but didn't bother to test]:
>
> data one;
> set ttotal(in = summing)
> ttotal(in = merging);
> by inst;
>
> if summing then total_fund = sum(total_fund*(^first.inst), fund,
0);
> if merging then output;
>
> run;
>
>
> HTH,
> David
> --
> David Cassell, CSC
> Cassell.David@epa.gov
> Senior computing specialist
> mathematical statistician
>