Date: Wed, 14 Dec 2005 13:03:48 -0500
Reply-To: Krishna Rama-Murthy <krm@spcregion.org>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Krishna Rama-Murthy <krm@spcregion.org>
Subject: Re: [BULK] Re: SPSS and large data files
Content-Type: text/plain; charset="us-ascii"
Essentially the basic idea of supercomputing is to split your task into
multiple subtasks. Can you split your file and your task (Millions of
records and 100's of variables) into sub files and sub tasks and then
finally combine the results? If yes, then you can drastically reduce
your computational time. Every time you run such a big file its goes
through millions of variables
It also depends how many times you are going to repeat this task. If it
is only a one time effort then I guess the solution is not cost
beneficial. If you are going to repeat this multiple times then may be
its worth giving a try.
HTH
Krishna Murthy
Transportation Analyst III
Southwestern Pennsylvania Commission
425 Sixth Avenue, Suite 2500
Pittsburgh, PA 15219
Tel: 412-391-5590 x 370
Fax: 412 391-9160
-----Original Message-----
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of
Oliver, Richard
Sent: Wednesday, December 14, 2005 11:25 AM
To: SPSSX-L@LISTSERV.UGA.EDU
Subject: Re: [BULK] Re: SPSS and large data files
In addition to only using Execute when absolutely necessary (which for
most practical purposes means very rarely), if you're reading data from
databases try your syntax with and without a CACHE command after GET
DATA to see if that makes any difference. Also avoid unnecessary SORT
commands. Here's an example with no Executes:
GET DATA /TYPE=ODBC /CONNECT=
'DSN=MS Access Database;DBQ=c:\program'+
' files\spss\tutorial\sample_files\demo.mdb;DriverId=25;FIL=MS Access;'
'MaxBufferSize=2048;PageTimeout=5;'
/SQL = 'SELECT ID, AGE, MARITAL FROM demo'.
CACHE.
DATASET NAME dataset1.
*Sort only necessary for subsequent MATCH if not already sorted.
SORT CASES BY ID.
GET
FILE='C:\Program Files\SPSS\Tutorial\sample_files\demo.sav'
/keep id income inccat.
DATASET NAME dataset2.
*Sort only necessary for subsequent MATCH if not already sorted.
SORT CASES BY ID.
MATCH FILE FILE=dataset1 /FILE=*
/BY ID.
DESCRIPTIVES VARIABLES=ALL.
-----Original Message-----
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of
Hoover, Matthew
Sent: Wednesday, December 14, 2005 9:51 AM
To: SPSSX-L@LISTSERV.UGA.EDU
Subject: [BULK] Re: SPSS and large data files
Importance: Low
Hello Kim,
I had this same problem when I started using SPSS and I was working with
very large data files. A couple of things that cuts down on processing
time is to make sure you are judicious in your use of "execute"
statements (this was a lesson I was very happy to learn and you might
already be aware of this). You only need to run an execute command when
previous calculations are needed for other transformations. Another
thing that has helped me (although it could also be a problem) is
running data that is stored on a computer network verses your PC. The
processing time of running data from a network is dramatically slower
than running information that is saved on your hard drive.
Matt
-----Original Message-----
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of
Kim Jinnett
Sent: Wednesday, December 14, 2005 10:40 AM
To: SPSSX-L@LISTSERV.UGA.EDU
Subject: Re: SPSS and large data files
Dear listers,
I'm a former SAS user from a very large research firm and find myself at
a
small research firm using SPSS on a PC. I'm working with very large data
files (millions of records and 100s of variables). SAS seemed to have
better data manipulation facility that SPSS, allowing multiple data
files to
be merged and variables selected all in an efficient data step. I have
version 14.0 of SPSS but am finding the speed of opening, merging and
then
running analyses on these large data files to take minutes of time for a
simple procedure...which means additional hours given how much we do
with
these datasets. I'm hoping someone with the list can advise me (I will
also
call SPSS tech support) on whether most SPSSers using large data files
work
with SQL or some other pre-SPSS step to manipulate and pick-off their
analytic variables before creating an SPSS dataset and running an
analyses.
With SAS, I was able to maintain a large dataset and simply select
variables
and cases for analysis at the analytic stage, or I might selectively
sample
the dataset periodically over the course of a year-long analysis.
I'm trying to sort out what are time limitations of SPSS in working with
large data files (manipulating them and analyzing them) and what are
time
limitations of my physical PC-based system.
Any advice appreciated.
Thanks,
Kim