| Date: | Tue, 28 Nov 2006 09:27:53 -0500 |
| Reply-To: | "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | "Howard Schreier <hs AT dc-sug DOT org>" <nospam@HOWLES.COM> |
| Subject: | Re: Looking for Year to Date Calculation Performance Improvements |
|---|
On Tue, 21 Nov 2006 12:23:35 -0800, Tony Gallegly <tony.gallegly@GMAIL.COM>
wrote:
>SAS-L,
>
>I am looking for a better way from a performance and memory usage
>standpoint to accomplish the following.
>
>I need to calculate a year to date average for each observation. The
>challenge is I have a minimum of 100,000 observations.
>
>Existing Code
>PROC SQL _METHOD BUFFERSIZE=1000000;
> CREATE TABLE AGGREGATE AS
> SELECT A.COST
> ,A.YEAR
> ,A.PERIOD
> ,A.WEEK
> ,A.DAY
> ,AVG(B.SCORE) AS AVG_SCORE
> FROM ZETA A,
> ZETA B
> WHERE A.YEAR = &CURRENT_YEAR
> AND B.YEAR = A.YEAR
> AND B.PERIOD <= A.PERIOD
> AND B.WEEK <= A.WEEK
> AND B.DAY <= A.DAY
> GROUP BY A.COST, A.YEAR, A.PERIOD, A.WEEK, A.DAY;
>QUIT;
>
>Sample Data: The avg_score field is the result of the above sql
>statement.
>COST YEAR PERIOD WEEK DAY SCORE AVG_SCORE
>301020 2006 1 1 1 91 91.00
>301020 2006 1 1 3 100 95.50
>301020 2006 1 1 7 100 97.00
>301020 2006 1 2 2 100 97.75
>301020 2006 1 2 4 100 98.20
>301020 2006 1 2 4 100 98.50
>301020 2006 1 2 6 100 98.71
>301020 2006 1 3 3 100 98.88
>301020 2006 1 3 5 92 98.11
>301020 2006 1 4 2 100 98.30
>301020 2006 1 4 3 100 98.45
>301020 2006 2 1 1 100 98.58
>301020 2006 2 1 3 100 98.69
>301020 2006 2 1 5 100 98.79
>301020 2006 2 2 2 100 98.87
>301020 2006 2 2 2 100 98.94
>301020 2006 2 2 4 100 99.00
>301020 2006 2 3 1 100 99.06
>301020 2006 2 3 1 90 98.58
>301020 2006 2 3 5 100 98.65
>301020 2006 2 4 2 100 98.71
>301020 2006 2 4 4 100 98.77
>301020 2006 2 4 6 92 98.48
>301020 2006 3 1 2 82 97.79
>301020 2006 3 1 2 100 97.88
>
>Thanks in advance,
>Tony
It looks like this code is calculating the same averages over and over again
for each day, once for each COST. That can be avoided.
Test data:
data zeta;
do day = '01sep2005'd to '31oct2006'd;
year = year(day);
do cost = '301020',
'a', 'b', 'c', 'd',
'e', 'f', 'g', 'h',
's', 't', 'u', 'v',
'w', 'x', 'y', 'z';
score = round(10*ranuni(3),1);
output;
end;
end;
format day date9.;
run;
%let current_year=2006;
Notice that I'm using the SAS date convention for DAY and getting rid of
PERIOD and WEEK, which Tony had in his data set but which were never
explained. Actually, I could get rid of YEAR as well, since it's a function
of DAY, but I keep it for convenience.
So here's Tony's code, adapted:
PROC SQL _METHOD BUFFERSIZE=1000000;
CREATE TABLE AGGREGATE AS
SELECT A.COST
,A.YEAR
,A.DAY
,AVG(B.SCORE) AS AVG_SCORE
FROM ZETA A,
ZETA B
WHERE A.YEAR = &CURRENT_YEAR
AND B.YEAR = A.YEAR
AND B.DAY <= A.DAY
GROUP BY A.COST, A.YEAR, A.DAY;
QUIT;
On my system:
real time 2:45.19
cpu time 1:06.90
The alternative is to compute each day-sepcific cumulative average just
once, by keeping COST out of the grouping specification, then doing a
follow-up join to reintroduce COST:
PROC SQL _METHOD BUFFERSIZE=1000000;
create table averages as
SELECT Y_D.YEAR
,Y_D.DAY
,AVG(ZETA.SCORE) AS AVG_SCORE
FROM (select distinct year, day
from zeta
where YEAR = &CURRENT_YEAR) as Y_D,
ZETA
WHERE ZETA.YEAR = &CURRENT_YEAR
AND ZETA.DAY <= Y_D.DAY
GROUP BY Y_D.YEAR, Y_D.DAY;
create table aggregate as
select * from zeta(drop=score) natural join averages;
QUIT;
Log shows:
NOTE: The execution of this query involves performing one
or more Cartesian product joins that can not be optimized.
real time 3.16 seconds
cpu time 2.57 seconds
Much faster.
Also, it's kind of funny that only the faster code raises the issue of the
Cartesian. That just demonstrates that the trigger for that note is pretty
simplistic.
|