See below.

Jon Peck
SPSS, an IBM Company
peck@us.ibm.com
312-651-3435



From: Richard Ristow <wrristow@mindspring.com>
To: SPSSX-L@LISTSERV.UGA.EDU
Date: 11/15/2009 07:15 PM
Subject: Re: [SPSSX-L] Advice regarding very large dataset
Sent by: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>





At 10:23 AM 11/15/2009, Jon K Peck wrote:

I'm not clear on why vectors don't meet the requirements for this problem.  You read in your data as usual and define a vector that in effect overlays the variable list.  Then you can use ordinary SPSS transformation looping commands such as LOOP and use the vector indexes as subscripts.

But, here's the data structure:

[There is] a single record for each visitor with up to 100 page views, and each page view is represented by many variables. A simplified schematic might be:

UserID  X1 Y1 Z1 X2 Y2 Z2....X100 Y100 Z100

There are many more than 3 variables per page view


It would be great to define vectors X, Y, and Z with indices 1-100. But SPSS can't do that; it requires all elements of any vector to be contiguous. You could, if all variables are numeric, define

VECTOR AllData X1 TO Z100.

but that leads to terribly clumsy code to calculate the index values.


>>>You could reorder the  variables easily with a little Python code (to avoid writing out the names).  Or do the transformations with a small Python program.

To reorder the variables (this requires the Python plugin from Developer Central):

data list free /UserID X1 Y1 Z1 X2 Y2 Z2 X3 Y3 Z3.
begin data
999 1 11 111 2 22 222 3 33 333
end data.
dataset name xyz.

begin program.
import spss, spssaux
xvars = spssaux.VariableDict(pattern="X")
yvars = spssaux.VariableDict(pattern="Y")
zvars = spssaux.VariableDict(pattern="Z")
keepers = sorted(xvars.variables) + sorted(yvars.variables) + sorted(zvars.variables)
spss.Submit("match files file=* /keep = UserID " + " ".join(keepers))
end program.

Note:
- The names are sorted strictly alphabetically.  That means that x10 comes before x2.

HTH,
Jon Peck



DO REPEAT does work.  It's a lengthy statement, since you have to name every variable:

DO REPEAT X = X1   X2   X3   X4   X5   X6   X7   X8   X9   X10
             X11  X12  X13  X14  [continuing to]
             X91  X92  X93  X94  X95  X96  X97  X98  X99  X100
        /Y = Y1   Y2   Y3   Y4   Y5   Y6   Y7   Y8   Y9   Y10
             ...
             Y91  Y92  Y93  Y94  Y95  Y96  Y97  Y98  Y99  Y100

       and the same for Z.      

As everybody knows, I usually advise 'unrolling' such structures to one record per event:

UserID PageView X Y Z

But it would be nice to have SPSS handle the original records more gracefully; for example, with a construct like

VECTOR X,Y,Z =X1 TO Z100.



===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD