Date: Thu, 13 May 2004 15:46:31 +1000
Reply-To: Frank Milthorpe <Frank.Milthorpe@dipnr.nsw.gov.au>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Frank Milthorpe <Frank.Milthorpe@dipnr.nsw.gov.au>
Subject: Still confused on what the CACHE command actually does and how it
works.
Content-Type: text/plain; charset=US-ASCII
I am still confused as how the CACHE command actually works. I am not
sure that the Help description listed below by Richard Oliver is
necessarily correct.
I generally get my data using a GET CAPTURE command to get data from
Oracle using ODBC. I then do some transformations of the data. It is my
understanding that there are separate temporary copies of the active
file. It would seem that there would have to be at least two copies; the
current file and the new version that is being created. I seem to
remember that the total disk space required is actually 3n (where n is
the size of the current file), so maybe there is a third copy.
From a literal reading of the Help system information (reproduced
below) it would suggest that SPSS is always going back and re-executing
the SQL command. Clearly this is the not the case once some
transformations have made to the file. So what is the CACHE command
doing? Is the CACHE command creating a temporary copy in memory rather
than writing it to disk in a temporary file? What happens once
transformations are made? What happens if the file is too big too fit in
memory?
I would welcome suggestions on how to make processing of large files
more efficient. I, probably like many other many others have relatively
well specified machines, in my case with 1G of memory.
Regards
Frank Milthorpe
>> "Oliver, Richard" <richard@spss.com> 13/05/2004 1:26 am >>>
Oh, yes, absolutely. CACHE can definitely improve performance when
working with data from a database source. From the help system:
Creating a Data Cache
Although the virtual active file can vastly reduce the amount of
temporary disk space required, the absence of a temporary copy of the
"active" file means that the original data source has to be reread for
each procedure. For large data files read from an external source,
creating a temporary copy of the data may improve performance. For
example, for data tables read from a database source, the SQL query that
reads the information from the database must be reexecuted for any
command or procedure that needs to read the data. Since virtually all
statistical analysis procedures and charting procedures need to read the
data, the SQL query is reexecuted for each procedure you run, which can
result in a significant increase in processing time if you run a large
number of procedures.
If you have sufficient disk space on the computer performing the
analysis (either your local computer or a remote server), you can
eliminate multiple SQL queries and improve processing time by creating a
data cache of the active file. The data cache is a temporary copy of the
complete data.
Note: By default, the Database Wizard automatically creates a data
cache, but if you use the GET DATA command in command syntax to read a
database, a data cache is not automatically created.
--------------------------------------------
Frank Milthorpe
Senior Manager, Transport Modelling
Transport and Population Data Centre (TPDC)
Department of Infrastructure, Planning and Natural Resources
GPO Box 3927, Sydney NSW 2001
Level 5, 20 Lee Street, Sydney
Direct: +61 2 9762 8488
Tel: +61 2 9762 8511
Fax: +61 2 9762 8514
Email: frank.milthorpe@dipnr.nsw.gov.au