| Date: | Mon, 15 Jul 1996 15:21:16 GMT |
| Reply-To: | braner@walden.snr.uvm.edu |
| Sender: | "SPSSX(r) Discussion" <SPSSX-L@UGA.CC.UGA.EDU> |
| From: | Moshe Braner <braner@EMBA.UVM.EDU> |
| Organization: | EMBA Computer Facility, The University of Vermont |
| Subject: | Re: Large memory use in SORT and a few other gripes |
|---|
Adrian Barnett (adrianb@dove.mtx.net.au) wrote:
: The file is a plain ASCII raw data file of 2.4meg - not a compressed
: system file so it doesn't 'grow'.
: ...
: The file to be sorted contains 14,183 cases of 1,664 bytes each.
: 14,614,592 bytes of memory are available to the sort.
: 18,008 bytes is the minimum in which the sort will run.
: 26,730,568 bytes would suffice for an in-memory sort.
First of all, I'd say that sorting such a large file in 2 minutes
on a PC is a feat that just a few years ago would have been
unthinkable. So the glass is half full...
The file _did_ grow: 14,183 cases of 1,664 bytes each is
23,600,512 bytes. The amount of space that SPSS said "would
suffice for an in-memory sort" is only slightly higher than that.
How did a 2.4-meg raw ASCII file turn into a 23-meg SPSS file?
My guess is that the file has a lot of variables that are
small integers, kept in 1 or 2 digits in the ASCII file.
But inside SPSS, all numeric variables are stored as
8-byte floating point numbers. (In "compressed" disk files
they are made somewhat smaller, but for an "in-memory" sort
the whole 8 bytes are needed for each number.) In theory,
SPSS could gain efficiency (in time and space) by having an
integer data type distinct from floating point. As it is,
one way to sort this file faster and in less space is to
leave those many integer variables as one long string variable
for the purpose of the sort. E.g., suppose that the raw file
has lines like this:
KEY V1 V2 V3 ...
876 1 34 2 ...
and we want to sort by the KEY. If we read the ASCII file into
SPSS parsing it into the variables KEY, V1, V2, V3, ... it becomes
large. We can read it like this instead:
data list ... /KEY 1-3 REST 5-80 (A).
then sort by KEY, and _then_ parse the REST. Can parse it via
a bunch of COMPUTE ... = SUBSTR(...) commands, or, can WRITE the
sorted file into a new ASCII file (still relatively small) and
then read it in with the DATA LIST command originally intended.
--
Moshe Braner
<Moshe.Braner@uvm.edu>
47 McGee Road, Essex Junction, VT 05452 USA
|