Date: Mon, 13 Dec 2004 15:58:58 -0700
Reply-To: Michael Murff <mjm33@MSM1.BYU.EDU>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Michael Murff <mjm33@MSM1.BYU.EDU>
Subject: Re: Proc Summary vs. Means run times
Content-Type: text/plain; charset=US-ASCII
Hi Mike,
How does one become privy to what SAS does behind the scenes? Have they
revealed some of their source code, presumably written in C? I thought
they kept such under very tight lock and key due to competitors like
SPSS and STATA. My understanding is that the procs are pre-compiled
binaries, and that datastep code is sort of translated down to C syntax.
Could you elaborate or refer me to other sources (papers) that would
have more info. as to what goes on "under the hood" when a SAS proc or
datastep code is submitted.
Thanks,
Michael Murff
PS--Perhaps I should relist this under a new topic, but I'll have to
consult the said SAS etiquette paper, to be sure about that :)
>>> Mike Rhoads <RHOADSM1@WESTAT.COM> 12/13/2004 3:46:44 PM >>>
Dave,
Welcome to the group!
Actually, PROC MEANS and PROC SUMMARY run exactly the same code behind
the
scenes. There are a couple of very minor differences, mainly that by
default PROC MEANS produces printed output and PROC SUMMARY does not.
So I suspect the differences you are seeing in output format and
execution
time are because you are using a BY statement in your PROC MEANS vs. a
CLASS
statement in PROC SUMMARY. Try using the same statement in both, and
you
should get identical output and near-identical run times.
Mike Rhoads
Westat
RhoadsM1@Westat.com
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
David
Meyer
Sent: Monday, December 13, 2004 5:32 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Proc Summary vs. Means run times
Hi SASLers,
As a new-ish SAS guy, I have been following the SASL discussion as much
as
I can and I have been learning a lot (THANKS ALL). As I have been
improving, I have caught the "try to write tighter code" bug from some
of
you and since I am working with large data sets (millions of records
each),
reducing run time is a very practical obsession to have.
I have recently discovered Proc Summary and been playing with it and
Proc
Means. I think that I found Summary to be about 35 to 45% of the run
time
of Proc Means (plus I like the "class variable crude" summary data line
in
Proc Summary and I like the way the data is displayed in the output
window
better then Means). If all I wanted is basic summary stats (mean min
max
std) should I always be using Summary going forward? Am I making any
assumptions that I should worry about / or are incorrect? Do any of
you
suggest places for me to go and read up on these basic statistical
Procs?
TIA and thanks for all of your discussion on other topics,
Dave
Below are the code and log results:
625 proc summary data=visit_sum missing print;
626 class member_no;
627 var day_diff ;
628 output out=diffs mean=Mean std=STDev ;
629 run;
NOTE: There were 48 observations read from the dataset WORK.VISIT_SUM.
NOTE: The data set WORK.DIFFS has 13 observations and 5 variables.
NOTE: PROCEDURE SUMMARY used:
real time 0.62 seconds
cpu time 0.05 seconds
630
631
632 proc means data=visit_sum missing print;
633 by member_no;
634 var day_diff ;
635 output out=diffs1 mean=Mean std=STDev ;
636 run;
NOTE: There were 48 observations read from the dataset WORK.VISIT_SUM.
NOTE: The data set WORK.DIFFS1 has 12 observations and 5 variables.
NOTE: PROCEDURE MEANS used:
real time 0.28 seconds
cpu time 0.03 seconds