LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2008, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Sun, 18 May 2008 23:44:13 -0400
Reply-To:   Arthur Tabachneck <art297@NETSCAPE.NET>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Arthur Tabachneck <art297@NETSCAPE.NET>
Subject:   Re: question about creating a data subset
Comments:   To: "Howard Schreier <hs AT dc-sug DOT org>" <schreier.junk.mail@GMAIL.COM>

Howard,

Per your suggestion I did run such a comparison, although you might want to modify my test code in case I didn't sufficiently control for all extraneous sources.

The code and results are shown below.

While I fully realize that some of our list members don't appreciate such comparisons, I (and I'm sure some others) would be interested in knowing why the observed differences occurred and if SAS 9.2 produces different results.

Anyone care to offer explanations?

Art

* Real Time CPU Time Rank; * --------- -------- ----;

*Create test data; data practice_dates (drop=i); do i=1 to 500000 ; id=round(1000000*ranuni(0)); do date='01JAN1969'd to '10JAN1969'd ; test++1 ; if ranuni( 90210 ) le .02 then passed='Y' ; else passed='N' ; output ; end ; end ; stop ; format date mmddyy10.; run ; * Trial 1: 3.10 3.06 N/A; * Trial 2: 3.06 3.04 N/A; * Trial 3: 3.10 3.04 N/A;

*Sort test data; proc sort data=work.practice_dates; by id descending passed; run; * Trial 1: 7.73 9.21 N/A; * Trial 2: 8.37 8.79 N/A; * Trial 3: 6.85 8.39 N/A;

*Create SASFILE; sasfile work.practice_dates.data open; * Trial 1: N/A N/A N/A; * Trial 2: N/A N/A N/A; * Trial 3: N/A N/A N/A;

*Join Approach; proc sql buffersize=1e7 noprint ; select T1.* from practice_dates as T1 , ( select distinct id from practice_dates where passed='Y' ) as T2 where T1.id=T2.id ; quit ; *Trial 1: 0.71 0.71 1; *Trial 2: 0.71 0.71 1; *Trial 3: 0.73 0.71 1;

*Data step with double DOW approach; data _null_; do until (last.id); set practice_dates; by id; if passed="Y" then passedAny=1; end; do until (last.id); set practice_dates; by id; if passedAny; end; run; *Trial 1: 2.34 2.34 2; *Trial 2: 2.32 2.32 2; *Trial 3: 2.34 2.34 2;

*Data step approach; data _null_ (drop=passed_test); set practice_dates; retain passed_test; by id; if first.id and passed eq 'Y' then passed_test=1; if passed_test; run; *Trial 1: 2.87 2.87 3; *Trial 2: 2.79 2.79 3; *Trial 3: 2.81 2.81 3;

*Subquerry approach; proc sql buffersize=1e7 noprint; select * from practice_dates where id in ( select distinct id from practice_dates where passed = 'Y' ) ; quit; * Trial 1: 3.38 3.37 4; * Trial 2: 3.38 3.37 4; * Trial 3: 3.38 3.37 4;

*Data step with two file set approach; data _null_; set practice_dates( in=first where=(passed='Y') ) practice_dates( in=second); by id; if first.id then flag + first - flag; if flag and second; run; *Trial 1: 3.51 3.51 5; *Trial 2: 3.51 3.50 5; *Trial 3: 3.51 3.51 5;

*Data step with merge approach; data _null_; merge practice_dates(keep = id passed rename=(passed=p) where=(p='Y') in=y) practice_dates; by id; if y; run; *Trial 1: 5.23 5.23 6 (Tie); *Trial 2: 5.23 5.23 6; *Trial 3: 5.23 5.23 6 (Tie);

*Data step with do until approach; data _null_ (where = ( new in (1,2) )) ; do until (last.id) ; set practice_dates ; by id ; select(passed); when('Y') yes=1; when('N') no=1; otherwise; end ; end ; if yes and no then new=1 ; else if yes then new=2 ; else if no then new=3 ; do until (last.id) ; set practice_dates ; by id ; end ; run ; *Trial 1: 5.23 5.23 6 (Tie); *Trial 2: 5.25 5.25 7; *Trial 3: 5.23 5.23 6 (Tie);

*Data step with self-interleave approach; data _null_; set practice_dates (in=first) practice_dates (in=second); by id; if first.id then flag = 0; if first and passed eq 'Y' then flag+1; else if second and flag gt 0; run; *Trial 1: 5.71 5.71 8; *Trial 2: 5.84 5.84 8; *Trial 3: 5.71 5.71 8;

*Hash Approach; data _null_; if _n_ = 1 then do; if 0 then set practice_dates; declare hash h(ordered:'a'); h.definekey('id','count'); h.definedata('id','date','test','passed'); h.definedone(); declare hash hp(ordered:'a'); hp.definekey('id'); hp.definedata('count','flag'); hp.definedone(); end; do until(eof); set practice_dates end = eof; flag = 0; if hp.find() ne 0 then count = 0; count ++ 1; if passed = 'Y' then flag = 1; hp.replace(); h.add(); end; declare hiter hi('h'); rc = hi.first(); do while(rc = 0); hp.find(); do _n_ = 1 to count while(rc = 0); if flag; rc = hi.next(); end; end; stop; run; *Trial 1: 23.29 23.29 9; *Trial 2: 23.36 23.32 9; *Trial 3: 23.34 23.34 9;

*Close SASFILE; sasfile work.practice_dates.data close; * Trial 1: N/A N/A N/A; * Trial 2: N/A N/A N/A; * Trial 3: N/A N/A N/A;

----------- On Sun, 18 May 2008 13:00:55 -0400, Howard Schreier <hs AT dc-sug DOT org> <schreier.junk.mail@GMAIL.COM> wrote:

>On Sat, 17 May 2008 19:21:58 -0400, Arthur Tabachneck <art297@NETSCAPE.NET> >wrote: > >>Muthia, >> >>I did compare your proposed solution against Ken's two offerings and, >>surprisingly (on a 4-processor machine running SAS 9.1.3 sp4 at least), >>the hash solution resulted in the worst times of the three methods. >>Additionally, the hash solution took over two and a half times longer than >>the combined sort and data step method that I had originally suggested. >> >>Interestingly, when I post the results from such tests, I get an equal >>number of thank you notes and complaints. In that regard, I should >>mention that my tests were not systematically controlled, although I did >>run 3 replications of each method. >> >>Roughly, on average, I got the following approximate results: >> >> Ken's Subquery solution: 24 seconds >> Ken's Join solution: 4.5 seconds >> Your hash solution: 27 seconds >> My sort+data step solution: 10 seconds >> >>Like Ken, I too would be interested in seeing how the four methods compare >>on SAS 9.2. >> >>Art >> >>p.s. In case anyone with 9.2 cares to try it, here is the code I used: > >I suggest eliminating the disk I-O at the beginning and the end by (1) >loading the test data into SASFILE and (2) counting the rows meeting the >condition instead of storing or displaying them. > >> >>*Create test data; >>data practice_dates ; >> do id=1 to 500000 ; >> do date='01JAN1969'd to '10JAN1969'd ; >> test++1 ; >> if ranuni( 90210 ) le .02 then passed='Y' ; >> else passed='N' ; >> output ; >> end ; >> end ; >> stop ; >> format date mmddyy10.; >>run ; > > sasfile practice_dates load; > >> >>* Subquery solution ; >>proc sql buffersize=1e7 _method ; >> create table pd1 as >> select * >> from practice_dates >> where id in ( select distinct id >> from practice_dates >> where passed = 'Y' ) >> ; >>quit; > > proc sql buffersize=1e7 _method noprint; > select count(*) into : count > from practice_dates > where id in ( select distinct id > from practice_dates > where passed = 'Y' ) > ; > %put Count is &count; > quit; > >(and corresponding changes to other solutions) > >> >>* Join solution ; >>proc sql buffersize=1e7 _method ; >> create table pd2 as >> select T1.* >> from practice_dates as T1 >> , ( select distinct id >> from practice_dates >> where passed='Y' ) as T2 >> where T1.id=T2.id >> ; >> quit ; >> >>*hash solution ; >>data need; >>if _n_ = 1 then do; >> if 0 then set practice_dates; >> declare hash h(ordered:'a'); >> h.definekey('id','count'); >> h.definedata('id','date','test','passed'); >> h.definedone(); >> declare hash hp(ordered:'a'); >> hp.definekey('id'); >> hp.definedata('count','flag'); >> hp.definedone(); >>end; >>do until(eof); >> set practice_dates end = eof; >> flag = 0; >> if hp.find() ne 0 then count = 0; >> count ++ 1; >> if passed = 'Y' then flag = 1; >> hp.replace(); >> >> h.add(); >>end; >>declare hiter hi('h'); >>rc = hi.first(); >>do while(rc = 0); >> hp.find(); >> do _n_ = 1 to count while(rc = 0); >> if flag then output; >> rc = hi.next(); >> end; >>end; >>drop rc count flag; >>stop; >>run; >> >>*Sort+data step solution ; >>proc sort data=practice_dates; >> by id descending passed; >>run; >> >>data want (drop=passed_test); >> set practice_dates; >> retain passed_test; >> by id; >> if first.id then do; >> if passed eq 'Y' then passed_test=1; >> else passed_test=0; >> end; >> if passed_test then output; >>run; >>----------- >>On Sat, 17 May 2008 15:54:29 -0400, Muthia Kachirayan >><muthia.kachirayan@GMAIL.COM> wrote: >> >>>On Sat, May 17, 2008 at 12:47 PM, Ken Borowiak <EvilPettingZoo97@aol.com> >>>wrote: >>> >>>> On Sat, 17 May 2008 11:36:14 -0400, Howard Schreier <hs AT dc-sug DOT >>org> >>>> <schreier.junk.mail@GMAIL.COM> wrote: >>>> >>>> >On Sat, 17 May 2008 10:24:52 -0400, Ken Borowiak < >>>> EvilPettingZoo97@AOL.COM> >>>> >wrote: >>>> > >>>> >>On Fri, 16 May 2008 14:08:15 -0400, Jack Clark <JClark@CHPDM.UMBC.EDU> >>>> wrote: >>>> >> >>>> >>>Here is a PROC SQL approach. The sort you have in your code is not >>>> >>>necessary. >>>> >>> >>>> >>>proc sql; >>>> >>> select * >>>> >>> from practice_dates >>>> >>> where id in (select distinct id >>>> >>> from practice_dates >>>> >>> where passed = 'Y') >>>> >>> ; >>>> >>>quit; >>>> >>> >>>> >>> >>>> >>>Jack Clark >>>> >>>Research Analyst >>>> >>>Center for Health Program Development and Management >>>> >>>University of Maryland, Baltimore County >>>> >>> >>>> >>> >>>> >>>-----Original Message----- >>>> >>>From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of >>>> >>>geraden72011 >>>> >>>Sent: Friday, May 16, 2008 1:54 PM >>>> >>>To: SAS-L@LISTSERV.UGA.EDU >>>> >>>Subject: question about creating a data subset >>>> >>> >>>> >>>I have the following example i created: >>>> >>> >>>> >>>data practice_dates; >>>> >>>input id date mmddyy8. test passed $; >>>> >>>format date mmddyy10.; >>>> >>>datalines; >>>> >>>1234 01021969 2345 Y >>>> >>>1234 01301969 3456 N >>>> >>>3157 02031969 2345 N >>>> >>>3157 02201969 2897 N >>>> >>>3157 04151969 2345 Y >>>> >>>1011 02051969 2345 N >>>> >>>1011 02211969 2345 N >>>> >>>1011 05201969 2897 N >>>> >>>2468 03211969 2234 Y >>>> >>>2468 07151969 2255 Y >>>> >>>; >>>> >>>proc sort data=practice_dates; >>>> >>>by id date; >>>> >>>run; >>>> >>> >>>> >>>What I want to do is to create a new datafile that has all of the >>>> >>>records for the ID's that have a passed = 'Y' for any of the tests. >>>> >>> >>>> >>>so for the above example my new datafile would have all of the >>>> >>>following >>>> >>> >>>> >>>1234 01021969 2345 Y >>>> >>>1234 01301969 3456 N >>>> >>>3157 02031969 2345 N >>>> >>>3157 02201969 2897 N >>>> >>>3157 04151969 2345 Y >>>> >>>2468 03211969 2234 Y >>>> >>>2468 07151969 2255 Y >>>> >>> >>>> >>>any help is greatly appreciated. >>>> >> >>>> >>On my to-do list is to research the implementation of SQL subqueries >>>> under >>>> >>the hood and their performance vis-a-vis inner joins. My working >>>> perception >>>> >>is that subqueries are not as efficient as joins, particularly when >>the >>>> join >>>> >>uses a hash strategy. >>>> >> >>>> >>Creating some sample data larger in scale than in the original post: >>>> >> >>>> >>data practice_dates ; >>>> >> do id=1 to 500000 ; >>>> >> do date='01JAN1969'd to '10JAN1969'd ; >>>> >> test++1 ; >>>> >> if ranuni( 90210 ) le .02 then passed='Y' ; >>>> >> else passed='N' ; >>>> >> output ; >>>> >> end ; >>>> >> end ; >>>> >> stop ; >>>> >> format date mmddyy10.; >>>> >> run ; >>>> >> >>>> >>NOTE: The data set PRACTICE_DATES has 5000000 observations and 4 >>>> variables. >>>> >> >>>> >>* Subquery solution ; >>>> >>proc sql buffersize=1e7 _method ; >>>> >> create table pd1 as >>>> >> select * >>>> >> from practice_dates >>>> >> where id in ( select distinct id >>>> >> from practice_dates >>>> >> where passed = 'Y' ) >>>> >> ; >>>> >>quit; >>>> >> >>>> >>NOTE: SQL execution methods chosen are: >>>> >> >>>> >> sqxcrta >>>> >> sqxfil >>>> >> sqxsrc( PRACTICE_DATES ) >>>> >> >>>> >>NOTE: SQL subquery execution methods chosen are: >>>> >> >>>> >> sqxsubq >>>> >> sqxuniq >>>> >> sqxsrc( PRACTICE_DATES ) >>>> >>NOTE: Table PD1 created, with 911810 rows and 4 columns. >>>> >> >>>> >>NOTE: PROCEDURE SQL used (Total process time): >>>> >> real time 32.45 seconds >>>> >> >>>> >> >>>> >> >>>> >>* Join solution ; >>>> >>proc sql buffersize=1e7 _method ; >>>> >> create table pd2 as >>>> >> select T1.* >>>> >> from practice_dates as T1 >>>> >> , ( select distinct id >>>> >> from practice_dates >>>> >> where passed='Y' ) as T2 >>>> >> where T1.id=T2.id >>>> >> ; >>>> >> quit ; >>>> >> >>>> >>NOTE: SQL execution methods chosen are: >>>> >> >>>> >> sqxcrta >>>> >> sqxjhsh <-- oh yeah... a hash join !!! >>>> >> sqxsrc( PRACTICE_DATES(alias = T1) ) >>>> >> sqxuniq >>>> >> sqxsrc( PRACTICE_DATES ) >>>> >>NOTE: Table PD2 created, with 911810 rows and 4 columns. >>>> >> >>>> >>268 quit ; >>>> >>NOTE: PROCEDURE SQL used (Total process time): >>>> >> real time 10.64 seconds >>>> >> >>>> >> >>>> >> >>>> >>In this example, the join solution was able to implement a hash join >>>> >>strategy and kicks the subquery's behind. >>>> >> >>>> >>Of course, a hash solution to the problem can implemented in a DATA >>step >>>> >>without requiring a sort. >>>> >> >>>> >>Pax, >>>> >>Ken Borowiak >>>> > >>>> >What version are you using? A couple of quick runs lead me to believe >>that >>>> >the advantage is much smaller with 9.2 than it is with 9.1.3. >>>> >>>> >>>> Howard and All, >>>> >>>> This example was run with V9.1.3 SP4. I don't have my mits on 9.2 yet, >>but >>>> I >>>> would be interested in seeing the results. >>>> >>>> pax, >>>> ken >>>> >>> >>>Here is a Hash version for an unsorted dataset. My system is small to test >>>the timing. I guess that Data step hash version will have a slender >>>advantage over Proc SQL method. >>> >>> >>>data need; >>>if _n_ = 1 then do; >>> if 0 then set practice_dates; >>> declare hash h(ordered:'a'); >>> h.definekey('id','count'); >>> h.definedata('id','date','test','passed'); >>> h.definedone(); >>> declare hash hp(ordered:'a'); >>> hp.definekey('id'); >>> hp.definedata('count','flag'); >>> hp.definedone(); >>>end; >>>do until(eof); >>> set practice_dates end = eof; >>> flag = 0; >>> if hp.find() ne 0 then count = 0; >>> count ++ 1; >>> if passed = 'Y' then flag = 1; >>> hp.replace(); >>> >>> h.add(); >>>end; >>>declare hiter hi('h'); >>>rc = hi.first(); >>>do while(rc = 0); >>> hp.find(); >>> do _n_ = 1 to count while(rc = 0); >>> if flag then output; >>> rc = hi.next(); >>> end; >>>end; >>>drop rc count flag; >>>stop; >>>run; >>> >>>Muthia Kachirayan


Back to: Top of message | Previous page | Main SAS-L page