Date: Thu, 12 Jul 2007 10:16:30 +0930
Reply-To: "Barnett, Adrian (DECS)" <Barnett.Adrian2@saugov.sa.gov.au>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: "Barnett, Adrian (DECS)" <Barnett.Adrian2@saugov.sa.gov.au>
Subject: Re: Optimization (was, re: SPSS and Java Interface?)
Content-Type: text/plain; charset="us-ascii"
From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of
Sent: Wednesday, 11 July 2007 9:12 AM To: SPSSX-L@LISTSERV.UGA.EDU
Subject: Optimization (was, re: SPSS and Java Interface?)
At 01:00 AM 7/10/2007, Adrian Barnett wrote:
>>On Thu, 5 Jul 2007 14:02:52 -0500, Weeks, Kyle <email@example.com>
>>>More information on the new feature set for SPSS 16 will be
>>>Kyle Weeks, Ph.D.
>>Could you please include some info on the following:
>>* If V16 is better able to make use of all available memory -
>>presently [SPSS] can't seem to make use of more than about 700-900 MB.
>>* Degree of support for multi-core processors and motherboards with
>>multiple CPUs. - Particularly desirable, sorting algorithms which can
>>spread the sort over multiple cores. [Although sorting is] the biggest
>>time-consumer - the more ANY of the heavy-duty data-manipulation tasks
>>could be spread like this the better.
>>I am interested because some projects are now pushing the boundaries
>>of what is feasible with current versions even with the most advanced
>Butting in, of course: I assume those projects have been analyzed for
>inefficiencies in design and implementation?
I had in mind classes of projects rather than specific ones. In the area
I work (government) there is increasing interest in analysing
operational data. These projects tend to involve pretty large volumes
of data. Record linkage is becoming much more recognized as a way to
go, linking data from multiple sources and so generating even bigger
files. When these involve transactional data recording lots of different
contacts over possible decades, they get pretty big indeed. In Western
Australia a group has been linking health-related data from an ever-
growing list of data providers for over 20 years, so these things can
get pretty big if you were to try to deal with all of it. In my
experience, the final file for analysis is much smaller than the initial
one, but there is a stage where you are preparing the initial data where
things are pretty big for a while.
Your point is well-made though, that the design of the data structures
needs a lot of thought and planning to try to ensure that processing
>The question comes up for me because SPSS data-manipulation tasks are
>most commonly limited by disk I/O speed.
>Sorting is something of a special case, and may get CPU bound under
>some conditions - I've no idea when, or how often. However - you've
>probably looked at this, but to speed sorting, what would be the
>relative importance of a second disk drive, so data can be read from
>one disk and written to the other; of more memory, with existing
>algorithms; of a dual-core algorithm? I'd expect that they'd usually be
>important in that order.
Indeed, keeping the swap file on a separate disk from the one the data
lives on is a way of reducing some of the impact of I/O during a sort.
The process of writing and reading temporary files SPSS builds when
sorting a file that won't fit in memory takes up the bulk of the time in
a sort. Now that SPSS reports CPU time separately from elapsed time it
is much easier to quantify the effect.
From what I've been able to read about sorting, the more memory
available, the less writing to disk and the faster the sort will run (
holding all other factors constant). The thing I and others on the
list have noticed is that current and previous versions of SPSS don't
seem to make use of memory beyond about 700-900 MB. Whilst the biggest
files I've worked on would take more then the 2GB available on one of
my systems, theory suggests that if SPSS did use all of the available
RAM, it would have improved things. If I tell SPSS to increase the
Workspace beyond those levels, it whinges that it can't get that much
memory, even though the Task Manager is showing there is another
gigabyte or so that isn't being used. It's not hard now to find
motherboards that support 8GB, and if the operating system would
support it (which 64 bit versions do), large projects would benefit a
In processing this big data files, the number of times the data has to
be sorted different ways prior to analysis seems amazing. And no
matter how careful one is, and how often testing is done on small
subsets, it always seems that the main data gets put through the
whole series of programs in the cycle lots more times than was ever
Computer scientists seem to have been working on sorting algorithms that
take advantage of more than one CPU for at least 20 years now, and
seem to provide improvements that scale well with additional CPUs. Given
dual core is quite common now amongst new computers in the general
market, and quad-core is not hard to obtain, the hardware is well and
truly available to take advantage. So sorting would definitely benefit
from the available algorithms that work well with 2+ CPUs.
>For other manipulations, it's rare a transformation program is
>CPU-bound; and faster, or dual, CPUs aren't likely to help with one
Yes, I/O seems to be the major bottleneck, but there's nothing like more
RAM for curing that - if only the system will make use of it!
An awful lot of work of a statistical nature isn't involved with
processing and restructuring large volumes of data. The stuff I'm aware
of at university research labs is rarely data-intensive and wouldn't
much be affected by how much RAM is available or how many processors
there were. So I guess most of the people on this list could not care
less about efficiency of memory use or sophistication of sorting
The stuff I have been banging on about concern a very different type of
work in a different type of organization. In the context in which I
work, the projects being talked about are starting to reach a point
where the capabilities of the hardware and software are a much more
important consideration in the feasibility of the project than they have
been. Previously, if things were taking too long, it was because you had
an old computer, and buying a new one would fix it. If the software
doesn't change soon, new hardware is no longer going to cure it.
Anyway, thanks for your thoughtful observations