Date: Thu, 24 Jun 2004 17:54:10 -0400
Reply-To: sashole@bellsouth.net
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Paul M. Dorfman" <sashole@BELLSOUTH.NET>
Organization: Sashole of Florida
Subject: Re: DoW Loop "Duh" experience - data set and infile N+1loops
In-Reply-To: <s0daf7b3.017@dgrm03.firsthealth.com>
Content-Type: text/plain; charset="us-ascii"
Jack,
This gotcha is exactly the reason whey using an explicit loop obviates it to
start with. Consider that you have
Data a ;
input x ;
cards ;
1
2
3
Run ;
and you want to:
1) print 'start' in the log before the 1st record is processed
2) print sum of x for all x not equal to 3
3) print 'stop' after the last record is processed
Usually the newbie stream of the conciousness leads this way (wrong!):
Data _null_ ;
if _n_ = 1 then put 'start' ;
set a end = end ;
if x ne 3 ;
sum_x ++ x ;
if end then do ;
put sum_x = ;
put 'stop' ;
end ;
Run ;
Why wrong? Because at x=3, the subsetting if will pass control after the sum
statement straight to the top of the step, then the step will stop once
control hits SET trying to read from an empty buffer. As a result, only STOP
will be printed. Of course, that can be repaired by eliminating the
subsetting IF and coding
If x ne 3 then sum_x ++ x ;
However, as you sagely pointed out, a more robust solution is to place all
file-start and file-end references before the file-reading verb (i.e. input,
set, update, merge, or modify), and it will print as requested:
Data _null_ ;
if _n_ = 1 then put 'start' ;
if end then do ;
put sum_x = ;
put 'stop' ;
end ;
set a end = end ;
if x ne 3 ;
sum_x ++ x ;
Run ;
But now the step looks rather contrived, and any non-SAS folk should wonder
why the heck the test for the end-of-file precedes the first time a record
is read from the file. Once I wrote a step like above (only with MERGE
counting the number of matches) for a friend of mine, a COBOL guy who
actually likes to understand things. After my explanation how it works and
why the statements were where they were, he asked why the same thing could
not be done in SAS just as follows:
1) print 'start'
2) read the file in a loop, skip unwanted records, and sum
3) print the value of sum_x
4) print 'stop'
5) stop
I said of course it can be done in SAS if you like it better this way:
Data _null_ ;
put 'start' ;
do until ( end ) ;
set a end = end ;
if x = 3 then continue ;
sum_x ++ x ;
end ;
put sum_x = ;
put 'stop' ;
stop ;
Run ;
Note that above, the STOP statement is critical, otherwise control will be
passed to the top of the *implied* loop, and START will be printed the
second time before control has hit SET and stop the stop - i.e. the feature
you made use of now has to be avoided. Now, though,
1) the sequence of the instructions in the step correspond to the sequence
of the things being done as we think of them
2) any non-SAS person with common sense will readily understand what happens
3) the code is more efficient because it eschewes IFs executed for each
record
3) the code is more robust
Again, all the improvements were achieved solely by moving the static
instructions (all PUTs) outside the loop and making then unconditional. But
then, it should be no surprise, because the latter simply makes the code
conform to one of the basic programming canons.
Kind regards,
----------------
Paul M. Dorfman
Jacksonville, FL
----------------
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> Behalf Of Jack Hamilton
> Sent: Thursday, June 24, 2004 4:48 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: DoW Loop "Duh" experience - data set and infile N+1loops
>
> I find that behavior useful when reporting processing data
> step counts - I can put subsetting IFs in my code, and know
> that the final PUT statement will always be executed (not tested):
>
> data _null_
>
> if end then
> put recsin= blahblah= purpleblahblah=;
>
> set somestuff end=end;
>
> recsin + 1;
>
> if blahblah > 10;
>
> highvalue + 1;
>
> if color = 'purple';
>
> purpleblahblah + 1;
>
> run;
>
>
>
> --
> JackHamilton@FirstHealth.com
> Manager, Technical Development
> Metrics Department, First Health
> West Sacramento, California USA
>
> >>> "Choate, Paul@DDS" <pchoate@DDS.CA.GOV> 06/24/2004 1:23 PM >>>
> Greeting SAS-Lers-
>
> Apologies if this is old news, but while pondering Paul
> Dorfman & Ian's "Do Loop of Whitlock" I had an "a-hah"
> experience the other day (or maybe it was an "a-duh"
> experience) - When reading an N record data set, SAS loops
> through the data-step N+1 times.
>
> The set or infile end flag is set when the last (Nth)
> observation is reached, but the data step continues into a
> final N+1 pass until where it arrives back at the set
> statement. That final half-pass is usually "invisible"
> because of the midstream termination of the data step before
> the default output. This allows final controls of the data
> step to be placed at the top of the step rather than at the
> bottom, and after the final implied output.
>
> data one;
> do i=1 to 5;
> output;
> end;
>
> data _null_;
> put _all_ ': '@;
> set one end=eof;
> put _all_ ;
> run;
>
> eof=0 i=. _ERROR_=0 _N_=1 : eof=0 i=1 _ERROR_=0 _N_=1 eof=0
> i=1 _ERROR_=0 _N_=2 : eof=0 i=2 _ERROR_=0 _N_=2 eof=0 i=2
> _ERROR_=0 _N_=3 : eof=0 i=3 _ERROR_=0 _N_=3 eof=0 i=3
> _ERROR_=0 _N_=4 : eof=0 i=4 _ERROR_=0 _N_=4 eof=0 i=4
> _ERROR_=0 _N_=5 : eof=1 i=5 _ERROR_=0 _N_=5
> eof=1 i=5 _ERROR_=0 _N_=6 :
>
>
> This is the same with an input statement:
>
> data _null_;
> do i=1 to 5;
> file 'one';
> put i;
> end;
>
> data _null_;
> put _all_ ': '@;
> infile 'one' end=eof;
> input I $ ;
> put _all_ ;
> run;
>
> eof=0 I= _ERROR_=0 _N_=1 : eof=0 I=1 _ERROR_=0 _N_=1 eof=0
> I= _ERROR_=0 _N_=2 : eof=0 I=2 _ERROR_=0 _N_=2 eof=0 I=
> _ERROR_=0 _N_=3 : eof=0 I=3 _ERROR_=0 _N_=3 eof=0 I=
> _ERROR_=0 _N_=4 : eof=0 I=4 _ERROR_=0 _N_=4 eof=0 I=
> _ERROR_=0 _N_=5 : eof=1 I=5 _ERROR_=0 _N_=5
> eof=1 I= _ERROR_=0 _N_=6 :
>
> Looking back at the "Flow of Action in the DATA Step"
> flowcharts in the v5, v6, v8, and v9 manual/docs I see it's
> been documented at least since 1985.
> Thanks once to more Paul & Ian!
>
> Paul Choate
> DDS Data Extraction
> (916) 654-2160
>
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> Behalf Of Paul M.
> Dorfman
> Sent: Thursday, June 24, 2004 11:51 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: Help with By group processing, thanks
>
> David,
>
> Another Dorfmanism would be to convert this self-interleave
> into a double-DoW. I think I have already posted it in reply
> to Toby, but I would like to stress another time that
> although the double-DoW:
>
> Data one ;
> do until ( last.inst ) ;
> set ttotal ;
> by inst ;
> totfund = sum (totfund, fund, 0) ;
> end ;
> do until ( last.inst ) ;
> set ttotal ;
> by inst ;
> output ;
> end ;
> Run ;
>
> looks less parsimonious than the self-interleave:
>
> data one;
> set ttotal(in = summing)
> ttotal(in = merging);
> by inst;
> if summing then do;
> if first.inst then total_fund = 0;
> total_fund = sum(total_fund, fund, 0);
> end;
> if merging then output;
> run;
>
> the double-DoW is structurally superior, as it does not rely
> on conditional logic *inside* a loop to make a decision.
> Instead of piling up all the observations from two different
> streams into a single by-pile and relying on IN= to split
> them, the double-DoW simply goes through each by-group
> twice:
> first coming from one input stream, then - from the other.
> And because the boundaries of the double-Dow coincide with
> those of the Data step itself, there is no need to initialize
> the cumulative variable explicitly using first.inst.
>
> As an exercise for curiousity, one may want to try
> foreseeing, without testing, what will happen to the output
> if the second BY statement in the double-DoW code is omitted
> or commented out, then run a test to see if the guess was
> right. Now here us a different variation on the same double-DoW
> theme:
>
> data two ;
> do count = 1 by 1 until ( last.inst ) ;
> set ttotal ;
> by inst ;
> totfund = sum (totfund, fund, 0) ;
> end ;
> do _n_ = 1 to count ;
> set ttotal ;
> output ;
> end ;
> run ;
>
> This can be more efficient than the first variant if, in
> addition to the sum, count is also needed. Note that in this
> case, the second BY statement is omitted, yet the output is
> as expected. I will let curious SAS-Lers ponder how this works.
>
> Kind regards,
> ----------------
> Paul M. Dorfman
> Jacksonville, FL
> ----------------
>
> > -----Original Message-----
> > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> Behalf Of
> > David L. Cassell
> > Sent: Wednesday, June 23, 2004 8:52 PM
> > To: SAS-L@LISTSERV.UGA.EDU
> > Subject: Re: Help with By group processing, thanks
> >
> > "Dunn, Toby" <tdunn@TEA.STATE.TX.US> sagely replied:
> > > As a data step solution, consider this option:
> > >
> > > data one;
> > > set ttotal (in = a)
> > > ttotal (in = b);
> > > by inst;
> > >
> > > if (a = 1) then do;
> > > if first.inst then total_fund = 0;
> > > if (fund ne .) then total_fund + fund; end;
> > >
> > > if (b = 1) then do;
> > > output;
> > > end;
> > > run;
> >
> > A good solution. I have privately referred to this
> technique as the
> > "Schreier Self-interleave", because I learned it from
> reading some of
> > Howard's posts some time ago. IIRC, Howard once said that
> he learned
> > it from Ian. But we can't name
> > *everything* after Ian. :-) I think it is time to take
> the approach
> > of the mathematicians; they couldn't name everything after
> Gauss, so
> > eventually they had to name things after the
> > *next* person to work with them.
> >
> > There is one important point I would like to make with this example.
> > In terms of documentation and maintenance (by others), I find it is
> > really helpful to use better names with my IN= options.
> > So I might re- label this data step like so:
> >
> > data one;
> > set ttotal(in = summing)
> > ttotal(in = merging);
> > by inst;
> >
> > if summing then do;
> > if first.inst then total_fund = 0;
> > total_fund = sum(total_fund, fund, 0);
> > end;
> >
> > if merging then output;
> >
> > run;
> >
> >
> > And, of course, one can always produce a Dorfmanism to turn
> the above
> > do-group into a single computation without the need for
> grouping. But
> > that kind of shoots down the whole 'make it more readable and
> > maintainable'
> > point. :-)
> >
> > Okay, okay, here's what I meant [but didn't bother to test]:
> >
> > data one;
> > set ttotal(in = summing)
> > ttotal(in = merging);
> > by inst;
> >
> > if summing then total_fund = sum(total_fund*(^first.inst), fund,
> 0);
> > if merging then output;
> >
> > run;
> >
> >
> > HTH,
> > David
> > --
> > David Cassell, CSC
> > Cassell.David@epa.gov
> > Senior computing specialist
> > mathematical statistician
> >
>
|