LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2007)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 12 Jul 2007 10:16:30 +0930
Reply-To:     "Barnett, Adrian (DECS)" <Barnett.Adrian2@saugov.sa.gov.au>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         "Barnett, Adrian (DECS)" <Barnett.Adrian2@saugov.sa.gov.au>
Subject:      Re: Optimization (was, re: SPSS and Java Interface?)
Comments: To: Richard Ristow <wrristow@mindspring.com>
In-Reply-To:  A<7.0.1.0.2.20070710190433.03a88698@mindspring.com>
Content-Type: text/plain; charset="us-ascii"

Hi Richard

-----Original Message----- From: SPSSX(r) Discussion [mailto:SPSSX-L@LISTSERV.UGA.EDU] On Behalf Of Richard Ristow Sent: Wednesday, 11 July 2007 9:12 AM To: SPSSX-L@LISTSERV.UGA.EDU

Subject: Optimization (was, re: SPSS and Java Interface?)

At 01:00 AM 7/10/2007, Adrian Barnett wrote:

>>On Thu, 5 Jul 2007 14:02:52 -0500, Weeks, Kyle <kweeks@spss.com> >>wrote: >> >>>More information on the new feature set for SPSS 16 will be >>>forthcoming soon. >>>Regards. >>>Kyle Weeks, Ph.D. >> >>Could you please include some info on the following: >> >>* If V16 is better able to make use of all available memory - >>presently [SPSS] can't seem to make use of more than about 700-900 MB. >> >>* Degree of support for multi-core processors and motherboards with >>multiple CPUs. - Particularly desirable, sorting algorithms which can >>spread the sort over multiple cores. [Although sorting is] the biggest >>time-consumer - the more ANY of the heavy-duty data-manipulation tasks >>could be spread like this the better. >> >>I am interested because some projects are now pushing the boundaries >>of what is feasible with current versions even with the most advanced >>hardware. > >Butting in, of course: I assume those projects have been analyzed for >inefficiencies in design and implementation?

I had in mind classes of projects rather than specific ones. In the area I work (government) there is increasing interest in analysing operational data. These projects tend to involve pretty large volumes of data. Record linkage is becoming much more recognized as a way to go, linking data from multiple sources and so generating even bigger files. When these involve transactional data recording lots of different contacts over possible decades, they get pretty big indeed. In Western Australia a group has been linking health-related data from an ever- growing list of data providers for over 20 years, so these things can get pretty big if you were to try to deal with all of it. In my experience, the final file for analysis is much smaller than the initial one, but there is a stage where you are preparing the initial data where things are pretty big for a while.

Your point is well-made though, that the design of the data structures needs a lot of thought and planning to try to ensure that processing is efficient.

>The question comes up for me because SPSS data-manipulation tasks are >most commonly limited by disk I/O speed. > >Sorting is something of a special case, and may get CPU bound under >some conditions - I've no idea when, or how often. However - you've >probably looked at this, but to speed sorting, what would be the >relative importance of a second disk drive, so data can be read from >one disk and written to the other; of more memory, with existing >algorithms; of a dual-core algorithm? I'd expect that they'd usually be >important in that order.

Indeed, keeping the swap file on a separate disk from the one the data lives on is a way of reducing some of the impact of I/O during a sort. The process of writing and reading temporary files SPSS builds when sorting a file that won't fit in memory takes up the bulk of the time in a sort. Now that SPSS reports CPU time separately from elapsed time it is much easier to quantify the effect.

From what I've been able to read about sorting, the more memory available, the less writing to disk and the faster the sort will run ( holding all other factors constant). The thing I and others on the list have noticed is that current and previous versions of SPSS don't seem to make use of memory beyond about 700-900 MB. Whilst the biggest files I've worked on would take more then the 2GB available on one of my systems, theory suggests that if SPSS did use all of the available RAM, it would have improved things. If I tell SPSS to increase the Workspace beyond those levels, it whinges that it can't get that much memory, even though the Task Manager is showing there is another gigabyte or so that isn't being used. It's not hard now to find motherboards that support 8GB, and if the operating system would support it (which 64 bit versions do), large projects would benefit a lot.

In processing this big data files, the number of times the data has to be sorted different ways prior to analysis seems amazing. And no matter how careful one is, and how often testing is done on small subsets, it always seems that the main data gets put through the whole series of programs in the cycle lots more times than was ever intended.

Computer scientists seem to have been working on sorting algorithms that take advantage of more than one CPU for at least 20 years now, and seem to provide improvements that scale well with additional CPUs. Given dual core is quite common now amongst new computers in the general market, and quad-core is not hard to obtain, the hardware is well and truly available to take advantage. So sorting would definitely benefit from the available algorithms that work well with 2+ CPUs.

>For other manipulations, it's rare a transformation program is >CPU-bound; and faster, or dual, CPUs aren't likely to help with one >that isn't.

Yes, I/O seems to be the major bottleneck, but there's nothing like more RAM for curing that - if only the system will make use of it!

An awful lot of work of a statistical nature isn't involved with processing and restructuring large volumes of data. The stuff I'm aware of at university research labs is rarely data-intensive and wouldn't much be affected by how much RAM is available or how many processors there were. So I guess most of the people on this list could not care less about efficiency of memory use or sophistication of sorting algorithms.

The stuff I have been banging on about concern a very different type of work in a different type of organization. In the context in which I work, the projects being talked about are starting to reach a point where the capabilities of the hardware and software are a much more important consideration in the feasibility of the project than they have been. Previously, if things were taking too long, it was because you had an old computer, and buying a new one would fix it. If the software doesn't change soon, new hardware is no longer going to cure it.

Anyway, thanks for your thoughtful observations

Regards

Adrian Barnett


Back to: Top of message | Previous page | Main SPSSX-L page