LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (August 2004, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 31 Aug 2004 13:44:59 -0700
Reply-To:     "Choate, Paul@DDS" <pchoate@DDS.CA.GOV>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Choate, Paul@DDS" <pchoate@DDS.CA.GOV>
Subject:      Re: Conserving cpu & real time in datasteps involving large datas
              ets
Comments: To: "dunnigan_k@YAHOO.COM" <dunnigan_k@YAHOO.COM>

Keith -

Where clauses can beat subsetting if's, but do not always. SQL can beat sort & merges, but not always.

You want to subset your data when you read it, as advised, the earlier the better. I always recommend using any prior knowledge about your data, i.e. always make your largest cuts first. (If you want all 27 year old males in a city then choosing the age before gender is much more efficient.)

IO time is usually a killer, so subset anything that isn't necessary and only touch the data as often is necessary. Use a conditional stop command if you aren't reading the whole data set.

Test variations of your code before you choose ... look into obs= and nobs= and the _n_ counter.

There is also "Run cancel;" rather than "Run;" to run your code sans data.

hth

Paul Choate DDS Data Extraction (916) 654-2160

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Dennis Diskin Sent: Tuesday, August 31, 2004 4:17 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Conserving cpu & real time in datasteps involving large datasets

Keith,

The second data step (with the KEEP as an option on the input dataset is the one you want.

WHERE definitly is more efficient than a subsetting IF. It shouldn't matter if it an option on the input dataset or a separate statement.

You can use both dataset options together (including in a PROC SQL).

HTH, Dennis Diskin

Keith Dunnigan <dunnigan_k@YAHOO.COM> wrote: Hi all,

I don't usually work with such large datasets that time is an issue, but I am presently working on a project that deals with hundreds of millions of wide records, hence time is important.

Any advice on how to run a few basic datasteps, merges, etc more time efficiently is appreciated.

For instance, in the case of the reading in of data from a large permanent dataset into a temporary one. Let's say we have a 100 million observation permanent dataset, call it perm.dat. Let's say it has one thousand variables, call them var1, var2, ..., var1000. If I want to only read in 13 variables into a work dataset, what's the quickest way to do that? Possibly one of the following:

Data dat; set temp.dat; keep var1-var13; run;

Data dat; set temp.dat (keep = var1-var13); run;

Data dat(keep = var1-var13); set temp.dat; run;

... Or are there others? Also would using a proc sql statement be quicker than using a data statment? If so, what form would work the quickest?

On a similar take, if I want to read in only a subset of the observations, I take it a 'where' statement works quicker than an 'if' statement. Where should it be placed (again, in the data line, the set line, or below?).

Similar comments on match merges would be welcomed also. Alternately, if there is a section online are in the common sas documentation that deals with this, perhaps you could refer me to it.

Many thanks in advance!

Keith Dunnigan Consulting Statistician Systems Seminar Consultants

__________________________________ Do you Yahoo!? New and Improved Yahoo! Mail - 100MB free storage! http://promotions.yahoo.com/new_mail


Back to: Top of message | Previous page | Main SAS-L page