Date: Wed, 31 Dec 2008 16:46:14 -0500
Reply-To: Arthur Tabachneck <art297@NETSCAPE.NET>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Arthur Tabachneck <art297@NETSCAPE.NET>
Subject: Re: How to each group last record ?
Based on a test I just ran I'd have to say 'use' rather than 'consider',
especially if one is on a multiple processor system running 9.1.3. I'd be
interested if someone could run the same comparison on a single processor
system, as well as on 9.2 (since I hear that hash includes greater
optimization in 9.2 as compared to 9.1.3).
I ran two sets of tests, one with 19,000,000 records and the other with
190,000,000 million records. The test file was simply:
77 data class (drop=i);
78 set sashelp.class;
79 do i=1 to 1000000;
80 if i=1000000 then lastone=1;
81 else lastone=0;
82 output;
83 end;
84 run;
For Howard's proposed hash solution I ran:
85 data _null_;
86 set class end=lastobs;
87 if _n_ = 1 then do;
88 declare hash lasts(ordered:'ascending');
89 result = lasts.defineKey('age');
90 result = lasts.defineData( 'name'
91 , 'sex'
92 , 'age'
93 , 'height'
94 , 'weight'
95 , 'lastone');
96 result = lasts.defineDone();
97 end;
98 result = lasts.replace( key: age
99 , data:name
100 , data:sex
101 , data:age
102 , data:height
103 , data:weight
104 , data:lastone);
105 if lastobs then rc = lasts.output(dataset: 'lasts');
106 run;
NOTE: The data set WORK.LASTS has 6 observations and 6 variables.
NOTE: There were 19000000 observations read from the data set WORK.CLASS.
NOTE: DATA statement used (Total process time):
real time 32.21 seconds
cpu time 32.15 seconds
For datanull's proc summary solution I ran:
107 proc summary nway missing data=class;
108 class age;
109 output out=work.lastObsInAges(drop=_:)
110 idgroup(last out(name sex height weight lastone)=)
111 ;
112 run;
NOTE: There were 19000000 observations read from the data set WORK.CLASS.
NOTE: The data set WORK.LASTOBSINAGES has 6 observations and 6 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 11.53 seconds
cpu time 28.98 seconds
Then, running the same tests, but on a test file that outputed each
sashelp.class record 10 million times:
Howard's hash solution:
NOTE: The data set WORK.LASTS has 6 observations and 6 variables.
NOTE: There were 190000000 observations read from the data set WORK.CLASS.
NOTE: DATA statement used (Total process time):
real time 5:27.82
cpu time 5:27.18
Datanull's proc summary solution:
NOTE: There were 190000000 observations read from the data set WORK.CLASS.
NOTE: The data set WORK.LASTOBSINAGES has 6 observations and 6 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 1:54.59
cpu time 4:47.84
As you can see, the proc summary solution took less cpu time and less real
time in both cases but, more importantly, the differences in real time (on
my system) were quite dramatic.
Art
---------
On Wed, 31 Dec 2008 14:31:12 -0600, ./ ADD NAME=Data _null_,
<iebupdte@GMAIL.COM> wrote:
>On 12/31/08, Howard Schreier <hs AT dc-sug DOT org>
><schreier.junk.mail@gmail.com> wrote:
>> Consider a hash solution:
>
>Consider a PROC SUMMARY solution.
>
>proc summary nway missing data=sashelp.class;
> class age;
> output out=work.lastObsInAges(drop=_type_ _freq_)
> idgroup(last out(_all_)=) / autoname
> ;
> run;