|
Howard,
Per your suggestion I did run such a comparison, although you might want
to modify my test code in case I didn't sufficiently control for all
extraneous sources.
The code and results are shown below.
While I fully realize that some of our list members don't appreciate such
comparisons, I (and I'm sure some others) would be interested in knowing
why the observed differences occurred and if SAS 9.2 produces different
results.
Anyone care to offer explanations?
Art
* Real Time CPU Time Rank;
* --------- -------- ----;
*Create test data;
data practice_dates (drop=i);
do i=1 to 500000 ;
id=round(1000000*ranuni(0));
do date='01JAN1969'd to '10JAN1969'd ;
test++1 ;
if ranuni( 90210 ) le .02 then passed='Y' ;
else passed='N' ;
output ;
end ;
end ;
stop ;
format date mmddyy10.;
run ;
* Trial 1: 3.10 3.06 N/A;
* Trial 2: 3.06 3.04 N/A;
* Trial 3: 3.10 3.04 N/A;
*Sort test data;
proc sort data=work.practice_dates;
by id descending passed;
run;
* Trial 1: 7.73 9.21 N/A;
* Trial 2: 8.37 8.79 N/A;
* Trial 3: 6.85 8.39 N/A;
*Create SASFILE;
sasfile work.practice_dates.data open;
* Trial 1: N/A N/A N/A;
* Trial 2: N/A N/A N/A;
* Trial 3: N/A N/A N/A;
*Join Approach;
proc sql buffersize=1e7 noprint ;
select T1.*
from practice_dates as T1
, ( select distinct id
from practice_dates
where passed='Y' ) as T2
where T1.id=T2.id
;
quit ;
*Trial 1: 0.71 0.71 1;
*Trial 2: 0.71 0.71 1;
*Trial 3: 0.73 0.71 1;
*Data step with double DOW approach;
data _null_;
do until (last.id);
set practice_dates;
by id;
if passed="Y" then passedAny=1;
end;
do until (last.id);
set practice_dates;
by id;
if passedAny;
end;
run;
*Trial 1: 2.34 2.34 2;
*Trial 2: 2.32 2.32 2;
*Trial 3: 2.34 2.34 2;
*Data step approach;
data _null_ (drop=passed_test);
set practice_dates;
retain passed_test;
by id;
if first.id and passed eq 'Y' then passed_test=1;
if passed_test;
run;
*Trial 1: 2.87 2.87 3;
*Trial 2: 2.79 2.79 3;
*Trial 3: 2.81 2.81 3;
*Subquerry approach;
proc sql buffersize=1e7 noprint;
select *
from practice_dates
where id in ( select distinct id
from practice_dates
where passed = 'Y' )
;
quit;
* Trial 1: 3.38 3.37 4;
* Trial 2: 3.38 3.37 4;
* Trial 3: 3.38 3.37 4;
*Data step with two file set approach;
data _null_;
set practice_dates( in=first where=(passed='Y') )
practice_dates( in=second);
by id;
if first.id then flag + first - flag;
if flag and second;
run;
*Trial 1: 3.51 3.51 5;
*Trial 2: 3.51 3.50 5;
*Trial 3: 3.51 3.51 5;
*Data step with merge approach;
data _null_;
merge
practice_dates(keep = id passed rename=(passed=p) where=(p='Y') in=y)
practice_dates;
by id;
if y;
run;
*Trial 1: 5.23 5.23 6 (Tie);
*Trial 2: 5.23 5.23 6;
*Trial 3: 5.23 5.23 6 (Tie);
*Data step with do until approach;
data _null_ (where = ( new in (1,2) )) ;
do until (last.id) ;
set practice_dates ;
by id ;
select(passed);
when('Y') yes=1;
when('N') no=1;
otherwise;
end ;
end ;
if yes and no then new=1 ;
else if yes then new=2 ;
else if no then new=3 ;
do until (last.id) ;
set practice_dates ;
by id ;
end ;
run ;
*Trial 1: 5.23 5.23 6 (Tie);
*Trial 2: 5.25 5.25 7;
*Trial 3: 5.23 5.23 6 (Tie);
*Data step with self-interleave approach;
data _null_;
set practice_dates (in=first)
practice_dates (in=second);
by id;
if first.id then flag = 0;
if first and passed eq 'Y' then flag+1;
else if second and flag gt 0;
run;
*Trial 1: 5.71 5.71 8;
*Trial 2: 5.84 5.84 8;
*Trial 3: 5.71 5.71 8;
*Hash Approach;
data _null_;
if _n_ = 1 then do;
if 0 then set practice_dates;
declare hash h(ordered:'a');
h.definekey('id','count');
h.definedata('id','date','test','passed');
h.definedone();
declare hash hp(ordered:'a');
hp.definekey('id');
hp.definedata('count','flag');
hp.definedone();
end;
do until(eof);
set practice_dates end = eof;
flag = 0;
if hp.find() ne 0 then count = 0;
count ++ 1;
if passed = 'Y' then flag = 1;
hp.replace();
h.add();
end;
declare hiter hi('h');
rc = hi.first();
do while(rc = 0);
hp.find();
do _n_ = 1 to count while(rc = 0);
if flag;
rc = hi.next();
end;
end;
stop;
run;
*Trial 1: 23.29 23.29 9;
*Trial 2: 23.36 23.32 9;
*Trial 3: 23.34 23.34 9;
*Close SASFILE;
sasfile work.practice_dates.data close;
* Trial 1: N/A N/A N/A;
* Trial 2: N/A N/A N/A;
* Trial 3: N/A N/A N/A;
-----------
On Sun, 18 May 2008 13:00:55 -0400, Howard Schreier <hs AT dc-sug DOT org>
<schreier.junk.mail@GMAIL.COM> wrote:
>On Sat, 17 May 2008 19:21:58 -0400, Arthur Tabachneck
<art297@NETSCAPE.NET>
>wrote:
>
>>Muthia,
>>
>>I did compare your proposed solution against Ken's two offerings and,
>>surprisingly (on a 4-processor machine running SAS 9.1.3 sp4 at least),
>>the hash solution resulted in the worst times of the three methods.
>>Additionally, the hash solution took over two and a half times longer
than
>>the combined sort and data step method that I had originally suggested.
>>
>>Interestingly, when I post the results from such tests, I get an equal
>>number of thank you notes and complaints. In that regard, I should
>>mention that my tests were not systematically controlled, although I did
>>run 3 replications of each method.
>>
>>Roughly, on average, I got the following approximate results:
>>
>> Ken's Subquery solution: 24 seconds
>> Ken's Join solution: 4.5 seconds
>> Your hash solution: 27 seconds
>> My sort+data step solution: 10 seconds
>>
>>Like Ken, I too would be interested in seeing how the four methods
compare
>>on SAS 9.2.
>>
>>Art
>>
>>p.s. In case anyone with 9.2 cares to try it, here is the code I used:
>
>I suggest eliminating the disk I-O at the beginning and the end by (1)
>loading the test data into SASFILE and (2) counting the rows meeting the
>condition instead of storing or displaying them.
>
>>
>>*Create test data;
>>data practice_dates ;
>> do id=1 to 500000 ;
>> do date='01JAN1969'd to '10JAN1969'd ;
>> test++1 ;
>> if ranuni( 90210 ) le .02 then passed='Y' ;
>> else passed='N' ;
>> output ;
>> end ;
>> end ;
>> stop ;
>> format date mmddyy10.;
>>run ;
>
> sasfile practice_dates load;
>
>>
>>* Subquery solution ;
>>proc sql buffersize=1e7 _method ;
>> create table pd1 as
>> select *
>> from practice_dates
>> where id in ( select distinct id
>> from practice_dates
>> where passed = 'Y' )
>> ;
>>quit;
>
> proc sql buffersize=1e7 _method noprint;
> select count(*) into : count
> from practice_dates
> where id in ( select distinct id
> from practice_dates
> where passed = 'Y' )
> ;
> %put Count is &count;
> quit;
>
>(and corresponding changes to other solutions)
>
>>
>>* Join solution ;
>>proc sql buffersize=1e7 _method ;
>> create table pd2 as
>> select T1.*
>> from practice_dates as T1
>> , ( select distinct id
>> from practice_dates
>> where passed='Y' ) as T2
>> where T1.id=T2.id
>> ;
>> quit ;
>>
>>*hash solution ;
>>data need;
>>if _n_ = 1 then do;
>> if 0 then set practice_dates;
>> declare hash h(ordered:'a');
>> h.definekey('id','count');
>> h.definedata('id','date','test','passed');
>> h.definedone();
>> declare hash hp(ordered:'a');
>> hp.definekey('id');
>> hp.definedata('count','flag');
>> hp.definedone();
>>end;
>>do until(eof);
>> set practice_dates end = eof;
>> flag = 0;
>> if hp.find() ne 0 then count = 0;
>> count ++ 1;
>> if passed = 'Y' then flag = 1;
>> hp.replace();
>>
>> h.add();
>>end;
>>declare hiter hi('h');
>>rc = hi.first();
>>do while(rc = 0);
>> hp.find();
>> do _n_ = 1 to count while(rc = 0);
>> if flag then output;
>> rc = hi.next();
>> end;
>>end;
>>drop rc count flag;
>>stop;
>>run;
>>
>>*Sort+data step solution ;
>>proc sort data=practice_dates;
>> by id descending passed;
>>run;
>>
>>data want (drop=passed_test);
>> set practice_dates;
>> retain passed_test;
>> by id;
>> if first.id then do;
>> if passed eq 'Y' then passed_test=1;
>> else passed_test=0;
>> end;
>> if passed_test then output;
>>run;
>>-----------
>>On Sat, 17 May 2008 15:54:29 -0400, Muthia Kachirayan
>><muthia.kachirayan@GMAIL.COM> wrote:
>>
>>>On Sat, May 17, 2008 at 12:47 PM, Ken Borowiak
<EvilPettingZoo97@aol.com>
>>>wrote:
>>>
>>>> On Sat, 17 May 2008 11:36:14 -0400, Howard Schreier <hs AT dc-sug DOT
>>org>
>>>> <schreier.junk.mail@GMAIL.COM> wrote:
>>>>
>>>> >On Sat, 17 May 2008 10:24:52 -0400, Ken Borowiak <
>>>> EvilPettingZoo97@AOL.COM>
>>>> >wrote:
>>>> >
>>>> >>On Fri, 16 May 2008 14:08:15 -0400, Jack Clark
<JClark@CHPDM.UMBC.EDU>
>>>> wrote:
>>>> >>
>>>> >>>Here is a PROC SQL approach. The sort you have in your code is not
>>>> >>>necessary.
>>>> >>>
>>>> >>>proc sql;
>>>> >>> select *
>>>> >>> from practice_dates
>>>> >>> where id in (select distinct id
>>>> >>> from practice_dates
>>>> >>> where passed = 'Y')
>>>> >>> ;
>>>> >>>quit;
>>>> >>>
>>>> >>>
>>>> >>>Jack Clark
>>>> >>>Research Analyst
>>>> >>>Center for Health Program Development and Management
>>>> >>>University of Maryland, Baltimore County
>>>> >>>
>>>> >>>
>>>> >>>-----Original Message-----
>>>> >>>From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf
Of
>>>> >>>geraden72011
>>>> >>>Sent: Friday, May 16, 2008 1:54 PM
>>>> >>>To: SAS-L@LISTSERV.UGA.EDU
>>>> >>>Subject: question about creating a data subset
>>>> >>>
>>>> >>>I have the following example i created:
>>>> >>>
>>>> >>>data practice_dates;
>>>> >>>input id date mmddyy8. test passed $;
>>>> >>>format date mmddyy10.;
>>>> >>>datalines;
>>>> >>>1234 01021969 2345 Y
>>>> >>>1234 01301969 3456 N
>>>> >>>3157 02031969 2345 N
>>>> >>>3157 02201969 2897 N
>>>> >>>3157 04151969 2345 Y
>>>> >>>1011 02051969 2345 N
>>>> >>>1011 02211969 2345 N
>>>> >>>1011 05201969 2897 N
>>>> >>>2468 03211969 2234 Y
>>>> >>>2468 07151969 2255 Y
>>>> >>>;
>>>> >>>proc sort data=practice_dates;
>>>> >>>by id date;
>>>> >>>run;
>>>> >>>
>>>> >>>What I want to do is to create a new datafile that has all of the
>>>> >>>records for the ID's that have a passed = 'Y' for any of the tests.
>>>> >>>
>>>> >>>so for the above example my new datafile would have all of the
>>>> >>>following
>>>> >>>
>>>> >>>1234 01021969 2345 Y
>>>> >>>1234 01301969 3456 N
>>>> >>>3157 02031969 2345 N
>>>> >>>3157 02201969 2897 N
>>>> >>>3157 04151969 2345 Y
>>>> >>>2468 03211969 2234 Y
>>>> >>>2468 07151969 2255 Y
>>>> >>>
>>>> >>>any help is greatly appreciated.
>>>> >>
>>>> >>On my to-do list is to research the implementation of SQL subqueries
>>>> under
>>>> >>the hood and their performance vis-a-vis inner joins. My working
>>>> perception
>>>> >>is that subqueries are not as efficient as joins, particularly when
>>the
>>>> join
>>>> >>uses a hash strategy.
>>>> >>
>>>> >>Creating some sample data larger in scale than in the original post:
>>>> >>
>>>> >>data practice_dates ;
>>>> >> do id=1 to 500000 ;
>>>> >> do date='01JAN1969'd to '10JAN1969'd ;
>>>> >> test++1 ;
>>>> >> if ranuni( 90210 ) le .02 then passed='Y' ;
>>>> >> else passed='N' ;
>>>> >> output ;
>>>> >> end ;
>>>> >> end ;
>>>> >> stop ;
>>>> >> format date mmddyy10.;
>>>> >> run ;
>>>> >>
>>>> >>NOTE: The data set PRACTICE_DATES has 5000000 observations and 4
>>>> variables.
>>>> >>
>>>> >>* Subquery solution ;
>>>> >>proc sql buffersize=1e7 _method ;
>>>> >> create table pd1 as
>>>> >> select *
>>>> >> from practice_dates
>>>> >> where id in ( select distinct id
>>>> >> from practice_dates
>>>> >> where passed = 'Y' )
>>>> >> ;
>>>> >>quit;
>>>> >>
>>>> >>NOTE: SQL execution methods chosen are:
>>>> >>
>>>> >> sqxcrta
>>>> >> sqxfil
>>>> >> sqxsrc( PRACTICE_DATES )
>>>> >>
>>>> >>NOTE: SQL subquery execution methods chosen are:
>>>> >>
>>>> >> sqxsubq
>>>> >> sqxuniq
>>>> >> sqxsrc( PRACTICE_DATES )
>>>> >>NOTE: Table PD1 created, with 911810 rows and 4 columns.
>>>> >>
>>>> >>NOTE: PROCEDURE SQL used (Total process time):
>>>> >> real time 32.45 seconds
>>>> >>
>>>> >>
>>>> >>
>>>> >>* Join solution ;
>>>> >>proc sql buffersize=1e7 _method ;
>>>> >> create table pd2 as
>>>> >> select T1.*
>>>> >> from practice_dates as T1
>>>> >> , ( select distinct id
>>>> >> from practice_dates
>>>> >> where passed='Y' ) as T2
>>>> >> where T1.id=T2.id
>>>> >> ;
>>>> >> quit ;
>>>> >>
>>>> >>NOTE: SQL execution methods chosen are:
>>>> >>
>>>> >> sqxcrta
>>>> >> sqxjhsh <-- oh yeah... a hash join !!!
>>>> >> sqxsrc( PRACTICE_DATES(alias = T1) )
>>>> >> sqxuniq
>>>> >> sqxsrc( PRACTICE_DATES )
>>>> >>NOTE: Table PD2 created, with 911810 rows and 4 columns.
>>>> >>
>>>> >>268 quit ;
>>>> >>NOTE: PROCEDURE SQL used (Total process time):
>>>> >> real time 10.64 seconds
>>>> >>
>>>> >>
>>>> >>
>>>> >>In this example, the join solution was able to implement a hash join
>>>> >>strategy and kicks the subquery's behind.
>>>> >>
>>>> >>Of course, a hash solution to the problem can implemented in a DATA
>>step
>>>> >>without requiring a sort.
>>>> >>
>>>> >>Pax,
>>>> >>Ken Borowiak
>>>> >
>>>> >What version are you using? A couple of quick runs lead me to believe
>>that
>>>> >the advantage is much smaller with 9.2 than it is with 9.1.3.
>>>>
>>>>
>>>> Howard and All,
>>>>
>>>> This example was run with V9.1.3 SP4. I don't have my mits on 9.2 yet,
>>but
>>>> I
>>>> would be interested in seeing the results.
>>>>
>>>> pax,
>>>> ken
>>>>
>>>
>>>Here is a Hash version for an unsorted dataset. My system is small to
test
>>>the timing. I guess that Data step hash version will have a slender
>>>advantage over Proc SQL method.
>>>
>>>
>>>data need;
>>>if _n_ = 1 then do;
>>> if 0 then set practice_dates;
>>> declare hash h(ordered:'a');
>>> h.definekey('id','count');
>>> h.definedata('id','date','test','passed');
>>> h.definedone();
>>> declare hash hp(ordered:'a');
>>> hp.definekey('id');
>>> hp.definedata('count','flag');
>>> hp.definedone();
>>>end;
>>>do until(eof);
>>> set practice_dates end = eof;
>>> flag = 0;
>>> if hp.find() ne 0 then count = 0;
>>> count ++ 1;
>>> if passed = 'Y' then flag = 1;
>>> hp.replace();
>>>
>>> h.add();
>>>end;
>>>declare hiter hi('h');
>>>rc = hi.first();
>>>do while(rc = 0);
>>> hp.find();
>>> do _n_ = 1 to count while(rc = 0);
>>> if flag then output;
>>> rc = hi.next();
>>> end;
>>>end;
>>>drop rc count flag;
>>>stop;
>>>run;
>>>
>>>Muthia Kachirayan
|