Date: Wed, 1 Nov 2006 18:07:46 -0500
Reply-To: Richard Ristow <wrristow@mindspring.com>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Richard Ristow <wrristow@mindspring.com>
Subject: Re: using scratch variables
In-Reply-To: <000001c6fd05$1c64a6e0$03d6a8c0@pcip>
Content-Type: text/plain; charset="us-ascii"; format=flowed
At 10:56 AM 10/31/2006, Anton Balabanov wrote, following up concerning
scratch variables and LAG. Text from that posting is quoted where it is
pertinent in the following discussion.
Before I start, thank you, Anton. You've raised deep and interesting
questions in your earlier postings and here. I took several days to
post back because I took several days to work it out - as far as I did.
To readers in general: this is long and technical, though I've made it
as clear as I could manage. There are two sections:
Regarding LAG and scratch variables (Example I):
Regarding LAG, permanent, and scratch variables (Example II):
..............................................
Regarding LAG and scratch variables (Example I):
>I saw 2 explanations [of "LAG" for scratch variables] in your posting.
>But ... neither of your explanations seems satisfactory, IMHO.
See the following discussion. I think they're consistent, and accurate.
In the analysis below, I argue that they're consistent both with SPSS
documentation and with the behavior observed.
I think you fall into trouble when you think of re-initialization of
scratch variables. From both the documentation and observed behavior,
that re-initialization does not happen.
>1. "It sounds like the implementation is something like 'the value of
>the variable just before the Nth preceding [overall]
>re-initialization'".
Example I: The following SPSS draft output; discussion follows. In this
file, variable "b.1" and "b.2" are entered as data, from your posting.
* ...... Example I: Post from here on ............. .
NUMERIC #A (F2).
NUMERIC @#A_BFR @#A_AFT B (F2).
COMPUTE @#A_BFR = #A.
* "the following syntax for 10-case file would return .
* [b.1]; instead, we have [b.2]:" .
. IF $casenum=1 #a=1.
. IF RANGE($casenum,5,7) #a=$casenum.
. COMPUTE b=LAG(#a,2).
COMPUTE @#A_AFT = #A.
LIST.
|-----------------------------|---------------------------|
|Output Created |01-NOV-2006 15:03:52 |
|-----------------------------|---------------------------|
LINE_NUM b.1 b.2 @#A_BFR @#A_AFT B
01 0 . 0 1 .
02 0 . 1 1 .
03 0 1 1 1 1
04 0 1 1 1 1
05 0 1 1 5 1
06 1 1 5 6 1
07 5 5 6 7 5
08 6 6 7 7 6
09 6 7 7 7 7
10 6 7 7 7 7
Number of cases read: 10 Number of cases listed: 10
* ...... Example I: End ............. .
In the above, B.1 is what you expected to see, and B.2 is what you saw.
B is what was calculated, and matches B.2. Variables @#A_BFR and
@#A_AFT record the values of scratch variable #A at the beginning and
end of the transformation program, for that case.
Your reasoning:
>Zeros are [predicted] because scratches are initialized to 0, not to
>SYSMIS.
This doesn't apply for cases 01 and 02. You have
. COMPUTE b=LAG(#a,2).
For cases 01 and 02, that's the value of #A from cases "-1" and "00",
neither of which exist; so the result is missing. #A is initialized to
0, but only when it comes into existence, i.e. in case 01.
>Zeros up to the 6th case are because only at 6th case we have 2
>re-initializations of #a.
#A is initialized to 0 at case 1; you then compute it as 1. As shown,
#A is 0 at the start of the input program for that case, and 1 at the
end.
But you don't have "2 re-initializations of #a": "SPSS does not
reinitialize scratch variables when reading a new case. Their values
are always carried across cases." (SPSS 14 Command Syntax Reference,
p.33).
In cases 02, 03, and 04 you don't change #A, so it keeps the value it
had had the end of case 01: namely, 1. (See @#A_BFR and @#A_AFT for
those cases.)
>Instead, we have [B.2, which matches variable B in the above listing].
Your code is
. COMPUTE b=LAG(#a,2).
In the output,B.2, and B, are missing for the first two cases, as
discussed above. In later cases, they have the value of @#A_AFT from
two cases before: "the value of the variable [#A] just before the Nth
preceding [overall] re-initialization."
>2. The second explanation "variable "#a", at the start of the case,
>has the value it had at the end of the preceding case"
I believe that's correct. As you can see, above, @#A_BFR for cases 2
and following, matches @#A_AFT for the immediately preceding case. In
case 1, @#A_BFR is 0, which is the value of #A the ONLY time it is
initialized.
>[This] is OK for LAG(#a) or LAG(#a,1)
It doesn't, that I can see, have anything to do with LAG; notice that
it doesn't mention LAG. As noted above, b=LAG(#a,2) is what my
hypothesis about LAG predicts.
>According to the CSR for SPSS13: "In a series of transformation
>commands without any intervening EXECUTE commands or other commands
>that read the data, lag functions are calculated after all other
>transformations, regardless of command order.", that is, #a in the
>current case had been already re-initialized
That's the mistake. As previously noted, #a is not re-initialized.
>Raynald Levesque [writes] "...if you assign ... a value to a scratch
>variable in case 1, then that value will remain the same for all
>subsequent cases UNLESS YOU change it yourself by syntax" brought me
>to another understanding of the process how SPSS works with scratch
>variables. The key word in the quotation above is "subsequent". It
>seems, SPSS REMEMBERS past values of scratch variables for each case.
Yes. You'll see that's exactly what is stated above.
>...just like it does with permanent variables,
Not quite; permanent variables are handled differently. Permanent
variables are "remembered" in that they're written to the working file;
scratch variables are "remembered" in that the values they had at the
end of one case's computations, are available at the beginning of the
next case's computation. (Permanent variables for which LEAVE is
specified are "remembered" in both senses.)
>SPSS keeps the last [calculated, not] initialized value for every
>subsequent case, unless it will be [calculated] via syntax next time.
>That is, scratch variable exists only [until] the first EXECUTE [or
>other procedure, or SAVE] and only in RAM of the computer.
I believe that is correct.
>But it is not a scalar, and it is not an array of serial
>re-initialized values. Instead, it is a column vector just like the
>permanent variable, but with different mode of re-initialization.
It's not possible to tell from the documentation or the observed
behavior, but I think scratch variables are probably scalars, i.e. not
written even temporarily as column vectors to the working file.
("Column vector" is not standard SPSS terminology, but it is accurate.)
That's OK for LAG, if it's implemented "in time" (your terminology),
i.e. "counts re-initializations." Which I think is accurate, except
that the re-initialization it the *global* re-initialization, from
which scratch variables, and permanent variables with LEAVE, are
exempt.
>This 'hypothesis' explains why LAG works well with scratch variables
>with any lag order. What do you think?
I think so. But I don't know whether "this 'hypothesis'," as I've
expounded it, should be considered consistent with yours, or not.
..............................................
Regarding LAG, permanent, and scratch variables (Example II):
>That is, LAG operates "in space" with permanent variable (i.e., in
>file sort order) and "in time" with scratch variable (i.e. counts
>re-initializations).
I would expect that both are implemented the same way, because it would
be very awkward to maintain two different implementations of LAG. If I
understand you, and interpret the following test correctly, both
implementations are "in time", as you put it. That is, LAG (VAR,N)
returns the value of variable 'VAR' from just before the Nth previous
global initialization, where a global initialization take place at the
close of the transformation program, just before a new case is begun.
To review: at a global initialization, numeric variables are generally
set to SYSMIS, and string variables to blank. However, this is not done
for scratch variables, or for permanent variables for which LEAVE has
been specified.
The following is SPSS draft output. Variables LINE_NUM and A have their
values at the start of this input program. All other computations are
shown.
* ...... Example II: Post from here on ............ .
NUMERIC #A ##A (F2).
NUMERIC @#A_BFR @#A_AFT (F2).
NUMERIC B_PERM B_SCRTCH (F2).
COMPUTE @#A_BFR = #A.
* "That is, LAG operates 'in space. with permanent .
* variable (i.e., in file sort order) and "in time" with .
* scratch variable (i.e. counts re-initializations)." .
. COMPUTE B_PERM = LAG(A,2).
. COMPUTE B_SCRTCH = LAG(#A,2).
. COMPUTE #A = A.
* Drop cases 5 and 7 (original numbering) .
. SELECT IF NOT ANY(LINE_NUM,5,7).
COMPUTE @#A_AFT = #A.
LIST.
|-----------------------------|---------------------------|
|Output Created |01-NOV-2006 17:20:56 |
|-----------------------------|---------------------------|
LINE_NUM A @#A_BFR @#A_AFT B_PERM B_SCRTCH
01 1 0 1 . .
02 3 1 3 . .
03 5 3 5 1 1
04 7 5 7 3 3
06 11 9 11 5 5
08 15 13 15 7 7
09 17 15 17 11 11
10 19 17 19 15 15
Number of cases read: 8 Number of cases listed: 8
* ...... Example II: End ............ .
. #A is computed as the value of A. Observe that the value of @#A_AFT
is the same as that of A, but the value of @#A_BFR is not.
. B_PERM is LAG(A,2). B_SCRTCH is LAG(#A,2), i.e. of a scratch
variable.
. Cases 05 and 07 (original numbering) are deleted.
Notice that B_PERM, lagging the permanent variable A, and B_SCRTCH,
lagging the scratch variable #A, are the same; and, in both cases, they
are the value of A (the same as #A), two cases before AFTER the
deletion.
With respect to lagging the scratch variable, this is consistent with
deletion "in time", i.e. saving values as they were before global
re-initialization, only of global re-initialization following a deleted
case, isn't counted. My guess is, that this is the case. In any case,
there's no evidence that logic for lagging A is different from that for
lagging #A.
My guess is that what you call "in time" logic is used for both. But
this test is certainly not definitive, and I can't think of one that
would be.
>Thank you for thorough explanation and pointing me out the example
>with INPUT PROGRAM, as well as "INPUT PROGRAM paradox" discussion.
>Indeed, LOOP within INPUT PROGRAM operates differently with scratch
>and permanent variables!
It does. However (not demonstrated) it operates the same for permanent
variables with LEAVE specified, as it does for scratch variables.
-With very best wishes,
Richard