LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 1996)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Mon, 20 May 1996 10:13:53 +0100
Reply-To:   R.A.Reese@UCC.HULL.AC.UK
Sender:   "SPSSX(r) Discussion" <SPSSX-L@UGA.CC.UGA.EDU>
From:   "R. Allan Reese" <R.A.Reese@UCC.HULL.AC.UK>
Subject:   Re: Sorting Large File :( -> ;-)
Comments:   cc: cudl list <cudl@hull.ac.uk>
In-Reply-To:   <199605180143.CAA02764@listserv.rl.ac.uk>

> >Tom Gift wrote:

> >I am using SPSS 6.1.3 and am trying to sort a > >data file that has 1.8 million cases and is 321 MB > >in size. That might seem like a ridiculously large > >file to be manipulating on a PC, but I have 574 MB > >free on the hard drive.

On Fri, 17 May 1996, Mike Palij wrote:

> ... one reason why its better to sort a file of this > size on a mainframe.

Mike Palij has a very understanding "mainframe" manager! One of the reasons PCs have grown in popularity is because they can apparently provide unlimited filestore - 1Gb currently seems to be the basic size and it's not vastly expensive to install, say, 2 x 2Gb drives on your desk. In connection with A/V work, I recently found that 9Gb drives were the basic unit, and these could be stacked. Compare that with running similar machines as network fileservers: we have about 10Gb but shared between thousands of users.

Going back to Tom Gift's problem, I'll quote a Radio program this week that discussed why software firms insist on hiring guys hardly out of nappies (diapers): "experience is being discarded all the time."

The problem as stated is a 1Gb drive containing at present a 321Mb "data" file. Is this raw data or an SPSS system file? If it's raw data - so that the job consists of reading, sorting and storing - one problem is that SPSS expands values so they all fit standard boxes. A single digit variable (1 byte of input) becomes a full-sized real (8 bytes?). Raw data typically becomes several times larger as a system file. The trade-off is that it's far faster to read. If that's a marginal problem, RTFM under the heading "system file compression".

Next bit of experience is to ask if the whole datafile is needed. Is the sorting part of an analysis that will use only some variables? Can you take a subset of variables and sort that file? If you are creating new variables - for example, combining several variables to form a single sort key - make sure you drop any redundant ones before sorting.

Finally, what did we do in prehistoric times, before the answer to any problem of size became "go buy a bigger computer"? The solution was "divide and rule", a technique scorned by computer users but widely adopted by management and politicians. In this example it may be a tad tedious, but compared with what you could do even 5 years ago, this is a helluva job on a desktop. Split your datafile into sections, say 200,000 cases in each (ie, casenum 1-200000, 200001-400000, ...), sort each section separately, then use MERGE to join all the sections into one big sorted file.

And do make backups at each stage, so you can recover without having to rerun every step.

R. Allan Reese Email: r.a.reese@ucc.hull.ac.uk Head of Applications, Computer Centre Direct voice: +44 1482 465296 Hull University Voice messages: +44 1482 465685 Hull HU6 7RX, U.K. Fax: +44 1482 466441


Back to: Top of message | Previous page | Main SPSSX-L page