Date: Fri, 20 Jun 2003 10:31:10 -0700
Reply-To: cassell.david@EPAMAIL.EPA.GOV
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV>
Subject: Re: How to speed up a SAS process?
Content-type: text/plain; charset=iso-8859-1
SAS User <sasuser@GUILDENSTERN.DYNDNS.ORG> sagely replied:
> > on Thu, Jun 19, 2003 at 05:22:54PM -0700, Annie Chang
(chang5a@YAHOO.COM) wrote:
> > When I deal with large datasets, say with 10 millions of
observations
> > of 200 variables, and try to sort it, sometimes it works reasonably
> > fine (I used option "tagsort"), sometime for the same datasets, it
> > takes forever to run and never finishes. I was sort of puzzled since
> > the CPU doesn't look busy at all and somehow SAS just decided to
take
> > a break in this kind of situations.
> >
> > Is something I could do in addition to get a better computer? (I do
> > have one with very large HD and 500 MB though ).
> Don't sort your data.
>
> Revisit your processing. Identify what you need to do with the
dataset.
> SAS offers a number of tools (CLASS processing, KEEP & DROP
statements,
> WHERE= dataset options) which can reduce dataload or eliminate need
for
> sorts. Array processing is another trick (there are examples in the
> SAS-L archives of people creating arrays with tens of millions of of
> elements). Unconventional thinking can have impressive payoffs.
> Googling "sas efficiency" will provide a number of references.
>
> Tagsort in particular is optimized for "fat" datasets -- fewer rows,
> many columns.
Exactly. The eponymous "SAS User" knows of what s?he speaks.
If you're hitting a boundary like this, then you may be in a situation
where
you have just barely enough work room on your hard drive for sorting the
file.
SAS likes to have roughly 3 times the size of the file for optimal
sorting.
And the work of the sorting is done primarily as read/write on that
drive.
So you would expect to see minimal CPU usage while the disk drives
struggle
mightily to cope with writing and re-writing data.
If you really need to do complex sorting, may I suggest my paper
SUGI 26: A Sort of a Mess -- Sorting Large Datasets on Multiple Keys
Paper 121-261 A Sort of a Mess ?Sorting Large Datasets on Multiple Keys
David L. Cassell
http://www2.sas.com/proceedings/sugi26/p121-26.pdf - 105.2KB -
Alternatively, I really recommend that you consider re-structuring your
entire process. If you find yourself sorting and re-sorting your data,
then
don't. Look at indexing instead. If you find yourself sorting in order
to
pull out small pieces based on some combination of variables, then
don't.
Look at indexing, or DATA step programming as an alternative. There are
tons of ways out of this bind if you have the time to sit and think. I
wrote
the above paper when I was faced with sorting Gigabytes of data and
adding
random keys, then re-sorting, and... So, when faced with doing 13
consecutive
sort-and-DATA-step processes, I was forced to take the old code and
revisit
the process. It now runs as one PROC DATASETS (to do an index), one
DATA step,
and then another indexing. No sorting.
HTH,
David
--
David Cassell, CSC
Cassell.David@epa.gov
Senior computing specialist
mathematical statistician
|