LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 2005)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 14 Dec 2005 13:03:48 -0500
Reply-To:     Krishna Rama-Murthy <krm@spcregion.org>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         Krishna Rama-Murthy <krm@spcregion.org>
Subject:      Re: [BULK] Re: SPSS and large data files
Content-Type: text/plain; charset="us-ascii"

Essentially the basic idea of supercomputing is to split your task into multiple subtasks. Can you split your file and your task (Millions of records and 100's of variables) into sub files and sub tasks and then finally combine the results? If yes, then you can drastically reduce your computational time. Every time you run such a big file its goes through millions of variables It also depends how many times you are going to repeat this task. If it is only a one time effort then I guess the solution is not cost beneficial. If you are going to repeat this multiple times then may be its worth giving a try.

HTH

Krishna Murthy

Transportation Analyst III

Southwestern Pennsylvania Commission

425 Sixth Avenue, Suite 2500

Pittsburgh, PA 15219

Tel: 412-391-5590 x 370

Fax: 412 391-9160

-----Original Message----- From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of Oliver, Richard Sent: Wednesday, December 14, 2005 11:25 AM To: SPSSX-L@LISTSERV.UGA.EDU Subject: Re: [BULK] Re: SPSS and large data files

In addition to only using Execute when absolutely necessary (which for most practical purposes means very rarely), if you're reading data from databases try your syntax with and without a CACHE command after GET DATA to see if that makes any difference. Also avoid unnecessary SORT commands. Here's an example with no Executes:

GET DATA /TYPE=ODBC /CONNECT= 'DSN=MS Access Database;DBQ=c:\program'+ ' files\spss\tutorial\sample_files\demo.mdb;DriverId=25;FIL=MS Access;' 'MaxBufferSize=2048;PageTimeout=5;' /SQL = 'SELECT ID, AGE, MARITAL FROM demo'. CACHE. DATASET NAME dataset1. *Sort only necessary for subsequent MATCH if not already sorted. SORT CASES BY ID. GET FILE='C:\Program Files\SPSS\Tutorial\sample_files\demo.sav' /keep id income inccat. DATASET NAME dataset2. *Sort only necessary for subsequent MATCH if not already sorted. SORT CASES BY ID. MATCH FILE FILE=dataset1 /FILE=* /BY ID. DESCRIPTIVES VARIABLES=ALL.

-----Original Message----- From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of Hoover, Matthew Sent: Wednesday, December 14, 2005 9:51 AM To: SPSSX-L@LISTSERV.UGA.EDU Subject: [BULK] Re: SPSS and large data files Importance: Low

Hello Kim,

I had this same problem when I started using SPSS and I was working with very large data files. A couple of things that cuts down on processing time is to make sure you are judicious in your use of "execute" statements (this was a lesson I was very happy to learn and you might already be aware of this). You only need to run an execute command when previous calculations are needed for other transformations. Another thing that has helped me (although it could also be a problem) is running data that is stored on a computer network verses your PC. The processing time of running data from a network is dramatically slower than running information that is saved on your hard drive.

Matt

-----Original Message----- From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of Kim Jinnett Sent: Wednesday, December 14, 2005 10:40 AM To: SPSSX-L@LISTSERV.UGA.EDU Subject: Re: SPSS and large data files

Dear listers,

I'm a former SAS user from a very large research firm and find myself at a small research firm using SPSS on a PC. I'm working with very large data files (millions of records and 100s of variables). SAS seemed to have better data manipulation facility that SPSS, allowing multiple data files to be merged and variables selected all in an efficient data step. I have version 14.0 of SPSS but am finding the speed of opening, merging and then running analyses on these large data files to take minutes of time for a simple procedure...which means additional hours given how much we do with these datasets. I'm hoping someone with the list can advise me (I will also call SPSS tech support) on whether most SPSSers using large data files work with SQL or some other pre-SPSS step to manipulate and pick-off their analytic variables before creating an SPSS dataset and running an analyses. With SAS, I was able to maintain a large dataset and simply select variables and cases for analysis at the analytic stage, or I might selectively sample the dataset periodically over the course of a year-long analysis.

I'm trying to sort out what are time limitations of SPSS in working with large data files (manipulating them and analyzing them) and what are time limitations of my physical PC-based system.

Any advice appreciated.

Thanks, Kim


Back to: Top of message | Previous page | Main SPSSX-L page