Date: Sat, 14 Mar 2009 17:01:17 -0400
Reply-To: Paul Dorfman <sashole@BELLSOUTH.NET>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Paul Dorfman <sashole@BELLSOUTH.NET>
Subject: Re: SUM statement paradox (bug)?
Tim,
I do not see how the superfluity of explicitly retaining an auto-retained
variable makes it excusable for the compiler to fail to initialize a SUM
statement variable to 0 when the documentation clearly and unambiguously
says that that is what it does without any ifs, ands, or buts.
Of course it is natural (and logical) that any variable identified by the
compiler in a data set descriptor is initially set to missing, and if it is
not subsequently populated at the execution time, they will remain null,
which is easy to prove by, say,
data _null_ ;
put _all_ ;
stop ;
set sashelp.class ;
run ;
----------------------------------------------------
Name= Sex= Age=. Height=. Weight=. _ERROR_=0 _N_=1
However, the SUM statement is a compile-time instruction exactly equivalent
to
retain VAR 0 ;
(plus a run-time instruction to increment VAR as ordered). Yet when the
compiler sees this RETAIN, it does populate VAR with 0 - as it should
according to the documentation. Why, then, I ask, this compile-time
instruction is obeyed whilst the same exact compile-time instruction
contained in the SUM statement is ignored - as it should not be, according
to the documentation?
In other words, if you think it natural for the step (with HUH containing a
numeric VAR):
data _null_ ;
put var = @ ;
var + 1 ;
set huh ;
run ;
to print var=. in _N_=1 iteration, then you should likewise think it
natural for the step:
data _null_ ;
put var = @ ;
retain var 0 ;
set huh ;
run ;
to print var=. in _N_=1 iteration. Does it? Nope, it prints var=0 instead.
Note that it is the compile-time equivalence of the SUM statement and
RETAIN that is emphasized by the only germane documentation piece I can
locate, viz., "The variable's value is retained from one iteration to the
next, as if it had appeared in a RETAIN statement."
The reason I deem such a behavior a bug is two-fold. First, it directly
contradicts the documentation. Second, I posit that the expectation of a
the SUM statement variable to be initialized to 0 at compile is so
ingrained in the mind of SAS programmers that it is unreasonable for them
to expect a different result. I am a fairly seasoned SUM statement
(ab)user, and this is the first time in 20+ years I have encountered such a
thing.
However, I may be wrong and the only one who expects the SUM statement
variable to be initialized to 0 under any circumstances. Y'all who expect a
different outcome, would you please raise your hands?
Why do I think the point is of import and hence dwell on it? Certainly not
because I derive any pleasure from eureka-ing a bug and rubbing it in to
SAS; I just want the software to behave as expected according to the
documentation. Thus, I had fully expected that the variable in a SUM
statement I had coded would produce a zero if run-time program control
never hit it during the step's execution. My logic was akin to:
if 0 then set model ;
...
if A then a_cnt + 1 ;
else if B then b_cnt + 1 ;
else VAR + 1 ;
MODEL contains VAR (and other variables), and its descriptor is read first
thing in the step to impose proper sequence and attributes (mainly,
formats) on the variables written to a delimited file subsequently
submitted to the SQL loader. VAR is one of those variables counting the
number of "exception" records not conforming to the conditions A and B.
According to my understanding of the SUM statement, I had fully expected it
to end up as 0 at the end of the step if no exceptions were found, and that
zero - to be written to the delimited file. Yet it produced a missing
value, which caused the ensuing production load to go belly up, since VAR
is defined in the production DW table as not null.
I could not drop VAR from MODEL (hard-coding VAR's format - or anything
else, for that matter - is contrary to the framework I had developed for
that ETL), and, simultaneously, I needed VAR to end up 0 if VAR+1 were
never executed. After I understood what happened and why (which took a
while in the middle of the night), I added:
retain var 0 ; *else non-exec VAR+1 produces null! ;
to the step, which takes care of the problem, but smacks of a downright
silly thing to do - talk about redundancy now! That seemed to be a
sufficient cause to post a beware note to SAS-L (with the samaritanism of
preventing someone from getting up at 3 am thrown in).
Finally, remember that when a SAS novice posts something like
retain var 0 ;
...
if condition then var + smth ;
s/he is almost inevitably notified by someone on the list that the RETAIN
above is superfluous. Now if we accepted the notion that under certain DATA
step circumstances, a SUM statement variable can end up missing and that it
is natural and expected, then from the viewpoint of defensive programming
we should stipulate that the statement must *always* be paired with a
separate explicit RETAIN initializing its variable to 0.
Should we?
Kind regards
------------
Paul Dorfman
Jax, FL
------------
On Fri, 13 Mar 2009 16:36:09 -0500, Tim Kynerd <tkynerd@ECD.ORG> wrote:
>Hi Paul,
>
>To echo the other folks, here's the behavior I'm seeing in SAS 9.1.3:
>
>When I run your code, I get what you get: var=. var=7 .
>When I comment out the SET statement in the second DATA step, I get: var=0
. As I would expect, the second DATA step only executes once because it has
no input.
>
>I knew about the implied RETAIN with the Sum statement, but I didn't know
this:
>
>----
>Tip: The variable is automatically set to 0 before SAS reads the first
observation. The variable's value is retained from one iteration to the
next, as if it had appeared in a RETAIN statement.
>Tip: To initialize a sum variable to a value other than 0, include it
in a RETAIN statement with an initial value.
>----
>
>As to why VAR is initially missing in the first step:
>
>----
>It is redundant to name any of these items in a RETAIN statement, because
their values are automatically retained from one iteration of the DATA step
to the next:
>
> *
>
> variables that are read with a SET, MERGE, MODIFY or UPDATE
statement
> *
>
> a variable whose value is assigned in a sum statement
> *
>
> the automatic variables _N_, _ERROR_, _I_, _CMD_, and _MSG_
> *
>
> variables that are created by the END= or IN= option in the SET,
MERGE, MODIFY, or UPDATE statement or by options that create variables in
the FILE and INFILE statements
> *
>
> data elements that are specified in a temporary array
> *
>
> array elements that are initialized in the ARRAY statement
> *
>
> elements of an array that have assigned initial values to any or all
of the elements on the ARRAY statement.
>
>You can, however, use a RETAIN statement to assign an initial value to any
of the previous items, with the exception of _N_ and _ERROR_.
>----
>
>The fact that SAS emphasizes the option of using a RETAIN statement to
assign an initial value makes it seem natural to me, if not strictly
*logical*, that variables in these cases are initialized to missing, as
they apparently are.
>
>Best,
>Tim
>
>
>Tim Kynerd
>Computer Programmer/Analyst
>ECD/HOPE
>4 Old River Place, Suite A
>Jackson, MS 39202
>P: (601) 944-9308
>F: (601) 944-0808
>tkynerd@ecd.org
>
>-----Original Message-----
>From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Paul
Dorfman
>Sent: Friday, March 13, 2009 1:16 PM
>To: SAS-L@LISTSERV.UGA.EDU
>Subject: SUM statement paradox (bug)?
>
>Y'all,
>
>For those inclined to "ponder over many a quaint and curious volume of
>forgotten lore", consider:
>
>data huh ;
> retain var 7 ;
>run ;
>
>data _null_ ;
> put var = @ ;
> var + 1 ;
> set huh ;
>run ;
>
>Paradox #1. I expect the second step to print var=0 var=7. It prints var=.
>var=7 instead, as if the SUM statement were not present.
>
>Paradox #2. If VAR is not in HUH, or is dropped from it, or SET is simply
>commented out, the step happily prints the expected var=0 var=7.
>
>It appears that the mere act of the compiler finding VAR in the descriptor
>of a data set named in a syntactically valid context changes the implicit
>sum-statement compiler action into RETAIN VAR . instead of the expected
>RETAIN VAR 0 (or, perhaps more likely, just abstains from setting VAR to
>zero and leaves it missing).
>
>I discovered it a hard way - a production load failed having detected a
>missing value for a NOT NULL column - and stared at the step in giddy
>incredulity for quite a while before I found that the reason was the SUM
>statement behaving not in the way it is documented.
>
>For nowhere in the SAS documentation this peculiar behavior is mentioned.
I
>only find the unequivocal assertions "the variable is automatically set to
>0 before SAS reads the first observation" and "The sum statement is
>equivalent to using the SUM function and the RETAIN statement, as shown
>here: retain variable 0; variable=sum(variable,expression);", for some
>reason listed under the "Tip" rubrics, as if the fundamental property of
>the SUM statement seared in the mind of any semicolon-worthy SAS
programmer
>is but a useful side effect.
>
>Since both directly contradict the experimental results capable of biting
>an unwary, I hereby declare it a bug.
>
>Kind regards
>------------
>Paul Dorfman
>Jax, FL
>------------
>
>This transmission is intended only for the use of the addressee and may
contain information that is privileged, confidential, and exempt from
disclosure under applicable law. If you are not the intended recipient, or
the employee or agent responsible for delivering the message to the
intended recipient, you are hereby notified that any dissemination,
distribution, or copying of this communication is strictly prohibited. If
you have received this communication in error, please notify us immediately
via e-mail at support@ecd.org.
|