Date: Tue, 31 Aug 2004 13:44:59 -0700
Reply-To: "Choate, Paul@DDS" <pchoate@DDS.CA.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Choate, Paul@DDS" <pchoate@DDS.CA.GOV>
Subject: Re: Conserving cpu & real time in datasteps involving large datas
Where clauses can beat subsetting if's, but do not always. SQL can beat
sort & merges, but not always.
You want to subset your data when you read it, as advised, the earlier the
better. I always recommend using any prior knowledge about your data, i.e.
always make your largest cuts first. (If you want all 27 year old males in
a city then choosing the age before gender is much more efficient.)
IO time is usually a killer, so subset anything that isn't necessary and
only touch the data as often is necessary. Use a conditional stop command
if you aren't reading the whole data set.
Test variations of your code before you choose ... look into obs= and nobs=
and the _n_ counter.
There is also "Run cancel;" rather than "Run;" to run your code sans data.
DDS Data Extraction
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Dennis
Sent: Tuesday, August 31, 2004 4:17 AM
Subject: Re: Conserving cpu & real time in datasteps involving large
The second data step (with the KEEP as an option on the input dataset is the
one you want.
WHERE definitly is more efficient than a subsetting IF. It shouldn't matter
if it an option on the input dataset or a separate statement.
You can use both dataset options together (including in a PROC SQL).
Keith Dunnigan <dunnigan_k@YAHOO.COM> wrote:
I don't usually work with such large datasets that time is an issue,
but I am presently working on a project that deals with hundreds of
millions of wide records, hence time is important.
Any advice on how to run a few basic datasteps, merges, etc more time
efficiently is appreciated.
For instance, in the case of the reading in of data from a large
permanent dataset into a temporary one. Let's say we have a 100
million observation permanent dataset, call it perm.dat. Let's say it
has one thousand variables, call them var1, var2, ..., var1000. If I
want to only read in 13 variables into a work dataset, what's the
quickest way to do that? Possibly one of the following:
set temp.dat (keep = var1-var13);
Data dat(keep = var1-var13);
... Or are there others? Also would using a proc sql statement be
quicker than using a data statment? If so, what form would work the
On a similar take, if I want to read in only a subset of the
observations, I take it a 'where' statement works quicker than an 'if'
statement. Where should it be placed (again, in the data line, the set
line, or below?).
Similar comments on match merges would be welcomed also.
Alternately, if there is a section online are in the common sas
documentation that deals with this, perhaps you could refer me to it.
Many thanks in advance!
Systems Seminar Consultants
Do you Yahoo!?
New and Improved Yahoo! Mail - 100MB free storage!