| Date: | Mon, 20 May 1996 10:13:53 +0100 |
| Reply-To: | R.A.Reese@UCC.HULL.AC.UK |
| Sender: | "SPSSX(r) Discussion" <SPSSX-L@UGA.CC.UGA.EDU> |
| From: | "R. Allan Reese" <R.A.Reese@UCC.HULL.AC.UK> |
| Subject: | Re: Sorting Large File :( -> ;-) |
|
| In-Reply-To: | <199605180143.CAA02764@listserv.rl.ac.uk> |
|---|
> >Tom Gift wrote:
> >I am using SPSS 6.1.3 and am trying to sort a
> >data file that has 1.8 million cases and is 321 MB
> >in size. That might seem like a ridiculously large
> >file to be manipulating on a PC, but I have 574 MB
> >free on the hard drive.
On Fri, 17 May 1996, Mike Palij wrote:
> ... one reason why its better to sort a file of this
> size on a mainframe.
Mike Palij has a very understanding "mainframe" manager! One of the
reasons PCs have grown in popularity is because they can apparently
provide unlimited filestore - 1Gb currently seems to be the basic size
and it's not vastly expensive to install, say, 2 x 2Gb drives on your
desk. In connection with A/V work, I recently found that 9Gb drives were
the basic unit, and these could be stacked. Compare that with running
similar machines as network fileservers: we have about 10Gb but shared
between thousands of users.
Going back to Tom Gift's problem, I'll quote a Radio program this week
that discussed why software firms insist on hiring guys hardly out of
nappies (diapers): "experience is being discarded all the time."
The problem as stated is a 1Gb drive containing at present a 321Mb "data"
file. Is this raw data or an SPSS system file? If it's raw data - so
that the job consists of reading, sorting and storing - one problem is
that SPSS expands values so they all fit standard boxes. A single digit
variable (1 byte of input) becomes a full-sized real (8 bytes?). Raw
data typically becomes several times larger as a system file. The
trade-off is that it's far faster to read. If that's a marginal problem,
RTFM under the heading "system file compression".
Next bit of experience is to ask if the whole datafile is needed. Is the
sorting part of an analysis that will use only some variables? Can you
take a subset of variables and sort that file? If you are creating new
variables - for example, combining several variables to form a single
sort key - make sure you drop any redundant ones before sorting.
Finally, what did we do in prehistoric times, before the answer to any
problem of size became "go buy a bigger computer"? The solution was
"divide and rule", a technique scorned by computer users but widely
adopted by management and politicians. In this example it may be a tad
tedious, but compared with what you could do even 5 years ago, this is a
helluva job on a desktop. Split your datafile into sections, say 200,000
cases in each (ie, casenum 1-200000, 200001-400000, ...), sort each
section separately, then use MERGE to join all the sections into one big
sorted file.
And do make backups at each stage, so you can recover without having to
rerun every step.
R. Allan Reese Email: r.a.reese@ucc.hull.ac.uk
Head of Applications, Computer Centre Direct voice: +44 1482 465296
Hull University Voice messages: +44 1482 465685
Hull HU6 7RX, U.K. Fax: +44 1482 466441
|