LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2009)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Sun, 15 Nov 2009 08:23:37 -0700
Reply-To:     Jon K Peck <peck@us.ibm.com>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         Jon K Peck <peck@us.ibm.com>
Subject:      Re: Advice regarding very large dataset
Comments: To: Albert-Jan Roskam <fomcl@yahoo.com>
In-Reply-To:  <637113.44004.qm@web110705.mail.gq1.yahoo.com>
Content-Type: multipart/alternative;

I'm not clear on why vectors don't meet the requirements for this problem. You read in your data as usual and define a vector that in effect overlays the variable list. Then you can use ordinary SPSS transformation looping commands such as LOOP and use the vector indexes as subscripts. Although the vector definition exists only during transformation processing, that seems to be the time you need it. You can also create new variables with VECTOR

Vector elements must all have the same type - you can't mix numbers and strings.

If you do want to go the Python route, I suggest looking at the SPSSINC TRANS extension command. Using that, you can just write a function that deals with the transformations themselves and leave the case looping and new variable creation to the extension command to take care of.

Regards.

Jon Peck SPSS, an IBM Company peck@us.ibm.com 312-651-3435

From: Albert-Jan Roskam <fomcl@yahoo.com> To: SPSSX-L@LISTSERV.UGA.EDU Date: 11/15/2009 04:21 AM Subject: Re: [SPSSX-L] Advice regarding very large dataset Sent by: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>

It's far easier to use Python for that. You could use the Cursor class to read in each (part of a) record. You could use your pseudo code for that (although i-1 probably won't work the way you want for the first item of the list).

BEGIN PROGRAM. import spss cur=spss.Cursor(accessTyep='w') for i in range(spss.GetCaseCount()): vars = cur.fetchone() for index, varx in enumerate(vars): if varx == ... # etc

cur.close() END PROGRAM.

See the free spss data management book for details.

Cheers!! Albert-Jan

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Before you criticize someone, walk a mile in their shoes, that way when you do criticize them, you're a mile away and you have their shoes! ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

--- On Sun, 11/15/09, Mark Vande Kamp <mevk@U.WASHINGTON.EDU> wrote:

> From: Mark Vande Kamp <mevk@U.WASHINGTON.EDU> > Subject: Re: [SPSSX-L] Advice regarding very large dataset > To: SPSSX-L@LISTSERV.UGA.EDU > Date: Sunday, November 15, 2009, 12:00 AM > OK. I'll try to ask a more specific > question. I think the main thing I > want to know is how to set up for and make a loop structure > that > efficiently deals with more than one indexed variable. So, > I'll provide > an example. Recall that I have web pageview data in which > each record > (starting with a unique ID) has up to 100 pageviews of data > (there are > variables for 100 pageviews in each record, but the later > variables are > empty for people who saw fewer pages.) > > Data structure > > ID PageID1 LoadTime1 UnloadTime1 PageID2 LoadTime2 > UnloadTime2....PageID100 LoadTime100 UnloadTime100 > > So, I want to do a series of analyses using the sets of > pageview > variables. Many of these analyses use more than one > variable at a time, > > For example, I might want to know how long people look at a > "HowTo" > page depending on whether the preceding page was a > "welcome" or an > "info" page. I'll write a loop below that I know won't be > complete (and > probably not correct) but it should demonstrate the type of > things I > want to do. > > for i = 2 to 100 > if (PageID(i) = "HowTo" and PageID(i-1) = "welcome") > HowToAfterWelcomeDuration = UnloadTime(i) - LoadTime(i). > if (PageID(i) = "HowTo" and PageID(i-1) = "info") > HowToAfterInfoDuration > = UnloadTime(i) - LoadTime(i). > end loop. > > *I understand that if people see more than one "HowTo" page > after > "welcome" or "info" that this syntax will return only the > last such > duration in the record and I know how to do fancier code to > deal with > that situation if necessary. > > My question is how to best get all the necessary variables > into an > indexable form so we can do this kind of thing with loops. > We are > currently doing a really kludgy method of creating 100 > repeated blocks > of SPSS syntax using Word mail-merge to replace the > indexing digits at > the end of the repeated variables, but that creates huge > syntax files > and is extremely cumbersome. > > My initial hope was that vectors were a sort of "variable > type" and we > could just read our data into vector variables. However, > I'm now under > the impression that vectors are a sort of ephemeral format > that goes > away after transformations are executed. They still might > be the best > way to address the situation I describe, but I'm not sure > how they would > be applied. > > I hope this explains our issues more understandably. > > Thanks for any help and/or suggestions, > > Mark > > > On Sat, 2009-11-14 at 07:47 -0500, Art Kendall wrote: > > It is hard to say without a great deal of familiarity > with your data. > > However, you might consider > > 1) a match files to make X's, Y's, Z's, contiguous > > 2) look up DO REPEAT > > 3) I do not recall whether set definitions are now > retained between data > > sets in version 18 which I intend to install soon. > > 4) See if you think changing to long layout and using > AGGREGATE by ID > > would help > > 5) try to keep the data on a local disk > > 6) the syntax to define a list of variables for do > repeat or vectors can > > be cannibalized via cut-and-paste or via INSERT. > > > > Before SPSS supported multiple files open at once and > before there were > > PCs with cut-and-paste across applications, I would > start a new set of > > syntax by editing an earlier set so that I could reuse > text that was > > complicated like a list of all the X's, Y's, Z's. > > > > Hope this helps > > Art Kendall > > Social Research Consultants > > > > > > Mark Vande Kamp wrote: > > > I have used SPSS for a long time but am just now > trying to learn about > > > things like vectors, loops and macros, because I > am starting a new project. > > > > > > We have a huge dataset of information regarding > website visitor movements > > > through a set of web pages. The tab-delimited > data are structured as a > > > single long record for each visitor with up to > 100 page views, and each page > > > view is represented by many variables. A > simplified schematic might be: > > > > > > UserID X1 Y1 Z1 X2 Y2 Z2....X100 Y100 Z100 > > > > > > Note that there are many more than 3 variables > per page view and up to > > > 400,000 records, so computation time is a big > issue. > > > > > > Many of the initial analyses are repeated for all > 100 page views to look for > > > things like "entry pages" for each user, so a > loop to, for example, test the > > > appropriate variable in each of the 100 page > views (and do other types of > > > data processing) would seem to be an appropriate > approach. Initially, I > > > thought the data might be imported into vector > variables to facilitate this > > > loop approach. However, I just read (I think) > that vectors are ephemeral and > > > not really a variable "type". > > > > > > So, I'm asking for advice regarding the ways to > read the data into SPSS and > > > do the repeated processing of the repeated groups > of variables that are > > > necessary. As I mentioned before, computationally > thrifty approaches would > > > be best, given the size of the dataset (in both # > of variables and # of cases). > > > > > > Thanks, > > > > > > Mark > > > > > > ===================== > > > To manage your subscription to SPSSX-L, send a > message to > > > LISTSERV@LISTSERV.UGA.EDU > (not to SPSSX-L), with no body text except the > > > command. To leave the list, send the command > > > SIGNOFF SPSSX-L > > > For a list of commands to manage subscriptions, > send the command > > > INFO REFCARD > > > > > > > > > > ===================== > > To manage your subscription to SPSSX-L, send a message > to > > LISTSERV@LISTSERV.UGA.EDU > (not to SPSSX-L), with no body text except the > > command. To leave the list, send the command > > SIGNOFF SPSSX-L > > For a list of commands to manage subscriptions, send > the command > > INFO REFCARD > > ===================== > To manage your subscription to SPSSX-L, send a message to > LISTSERV@LISTSERV.UGA.EDU > (not to SPSSX-L), with no body text except the > command. To leave the list, send the command > SIGNOFF SPSSX-L > For a list of commands to manage subscriptions, send the > command > INFO REFCARD >

===================== To manage your subscription to SPSSX-L, send a message to LISTSERV@LISTSERV.UGA.EDU (not to SPSSX-L), with no body text except the command. To leave the list, send the command SIGNOFF SPSSX-L For a list of commands to manage subscriptions, send the command INFO REFCARD


[text/html]


Back to: Top of message | Previous page | Main SPSSX-L page