Date: Tue, 28 Nov 2006 09:27:53 -0500 "Howard Schreier " "SAS(r) Discussion" "Howard Schreier " Re: Looking for Year to Date Calculation Performance Improvements

On Tue, 21 Nov 2006 12:23:35 -0800, Tony Gallegly <tony.gallegly@GMAIL.COM> wrote:

>SAS-L, > >I am looking for a better way from a performance and memory usage >standpoint to accomplish the following. > >I need to calculate a year to date average for each observation. The >challenge is I have a minimum of 100,000 observations. > >Existing Code >PROC SQL _METHOD BUFFERSIZE=1000000; > CREATE TABLE AGGREGATE AS > SELECT A.COST > ,A.YEAR > ,A.PERIOD > ,A.WEEK > ,A.DAY > ,AVG(B.SCORE) AS AVG_SCORE > FROM ZETA A, > ZETA B > WHERE A.YEAR = &CURRENT_YEAR > AND B.YEAR = A.YEAR > AND B.PERIOD <= A.PERIOD > AND B.WEEK <= A.WEEK > AND B.DAY <= A.DAY > GROUP BY A.COST, A.YEAR, A.PERIOD, A.WEEK, A.DAY; >QUIT; > >Sample Data: The avg_score field is the result of the above sql >statement. >COST YEAR PERIOD WEEK DAY SCORE AVG_SCORE >301020 2006 1 1 1 91 91.00 >301020 2006 1 1 3 100 95.50 >301020 2006 1 1 7 100 97.00 >301020 2006 1 2 2 100 97.75 >301020 2006 1 2 4 100 98.20 >301020 2006 1 2 4 100 98.50 >301020 2006 1 2 6 100 98.71 >301020 2006 1 3 3 100 98.88 >301020 2006 1 3 5 92 98.11 >301020 2006 1 4 2 100 98.30 >301020 2006 1 4 3 100 98.45 >301020 2006 2 1 1 100 98.58 >301020 2006 2 1 3 100 98.69 >301020 2006 2 1 5 100 98.79 >301020 2006 2 2 2 100 98.87 >301020 2006 2 2 2 100 98.94 >301020 2006 2 2 4 100 99.00 >301020 2006 2 3 1 100 99.06 >301020 2006 2 3 1 90 98.58 >301020 2006 2 3 5 100 98.65 >301020 2006 2 4 2 100 98.71 >301020 2006 2 4 4 100 98.77 >301020 2006 2 4 6 92 98.48 >301020 2006 3 1 2 82 97.79 >301020 2006 3 1 2 100 97.88 > >Thanks in advance, >Tony

It looks like this code is calculating the same averages over and over again for each day, once for each COST. That can be avoided.

Test data:

data zeta; do day = '01sep2005'd to '31oct2006'd; year = year(day); do cost = '301020', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z'; score = round(10*ranuni(3),1); output; end; end; format day date9.; run;

%let current_year=2006;

Notice that I'm using the SAS date convention for DAY and getting rid of PERIOD and WEEK, which Tony had in his data set but which were never explained. Actually, I could get rid of YEAR as well, since it's a function of DAY, but I keep it for convenience.

PROC SQL _METHOD BUFFERSIZE=1000000; CREATE TABLE AGGREGATE AS SELECT A.COST ,A.YEAR ,A.DAY ,AVG(B.SCORE) AS AVG_SCORE FROM ZETA A, ZETA B WHERE A.YEAR = &CURRENT_YEAR AND B.YEAR = A.YEAR AND B.DAY <= A.DAY GROUP BY A.COST, A.YEAR, A.DAY; QUIT;

On my system:

real time 2:45.19 cpu time 1:06.90

The alternative is to compute each day-sepcific cumulative average just once, by keeping COST out of the grouping specification, then doing a follow-up join to reintroduce COST:

PROC SQL _METHOD BUFFERSIZE=1000000;

create table averages as SELECT Y_D.YEAR ,Y_D.DAY ,AVG(ZETA.SCORE) AS AVG_SCORE FROM (select distinct year, day from zeta where YEAR = &CURRENT_YEAR) as Y_D, ZETA WHERE ZETA.YEAR = &CURRENT_YEAR AND ZETA.DAY <= Y_D.DAY GROUP BY Y_D.YEAR, Y_D.DAY;

create table aggregate as select * from zeta(drop=score) natural join averages;

QUIT;

Log shows:

NOTE: The execution of this query involves performing one or more Cartesian product joins that can not be optimized.

real time 3.16 seconds cpu time 2.57 seconds

Much faster.

Also, it's kind of funny that only the faster code raises the issue of the Cartesian. That just demonstrates that the trigger for that note is pretty simplistic.

Back to: Top of message | Previous page | Main SAS-L page