Date: Tue, 19 Feb 2008 20:20:18 -0500
Reply-To: Arthur Tabachneck <art297@NETSCAPE.NET>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Arthur Tabachneck <art297@NETSCAPE.NET>
Subject: Re: Why does retain work faster conditionally?
Content-Type: text/plain; charset=ISO-8859-1
First, thank you to everyone who responded and, hopefully, we won't spend
much more (if any) bandwidth on this matter.
For the newbees who may be following the thread, you may need a bit of
background. Rather than being a programmer, I'm what one might call one
of those pointed-haired (if I had any) Dilbert-type managers.
Additionally, rather than having a programming background, I'm a cross
between a Psychologist and a Statistician.
In short, PLEASE don't EVER look at my suggested code as being
representative of good programing skills, unless you happen to be either a
Psychologist, a statistician, a pointed-hair administrator, or a pointed-
hair administrator wannabe (wannabe for only lack of hair of course).
That said, Paul (and others), I wasn't testing SAS but, rather, the
practical application of SAS. Paul, you may be happy to know that, given
your test, your computer actually works faster than mine in that
particular situation (note-to-self: buy a faster server).
Pointy-haired-administrator-wannabees have to evaluate what we're given,
elephants and all, since that is what we have to administer.
As such, after everyone was off the server, I retested what I had posted
earlier this morning. On a 4-processor Windows 2003 server, I got the
same results as I did this morning. The following are the results I got
on eight separate tests:
cpu times
iteration type create_file test
1 no_if 0.61 1.39
1 with_if 0.54 0.96
1 sql 0.53 0.92
2 no_if 0.65 1.42
2 with_if 0.60 1.03
2 sql 0.60 0.98
3 no_if 0.67 1.40
3 with_if 0.64 1.06
3 sql 0.60 0.92
4 no_if 0.68 1.42
4 with_if 0.60 1.01
4 sql 0.51 0.90
5 no_if 0.65 1.43
5 with_if 0.60 1.07
5 sql 0.56 0.95
6 no_if 0.67 1.42
6 with_if 0.64 1.12
6 sql 0.61 1.03
7 no_if 0.65 1.39
7 with_if 0.62 1.12
7 sql 0.56 1.00
8 no_if 0.62 1.37
8 with_if 0.61 1.10
8 sql 0.64 0.98
I'll have to admit, SQL for this particular task, was always faster.
However, I still don't understand why adding the if condition on a 4-
processor server made it preferable to not including the if condition.
Hopefully, a birdie might be able to tell me, off-line, why. If one does,
you can be sure that I'll add one more post to this thread.
Art
----------
On Tue, 19 Feb 2008 15:52:14 +0000, Paul Dorfman <sashole@BELLSOUTH.NET>
wrote:
>Art,
>
>I suspect that this difference in the run times is dictated by the
external factors rather than the differences between the two DATA step
versions. I have eliminated the output data set HAVE to reduce I/O
background noise and repeated the test twice for consistency sake (under
Windows XPro on a T61 ThinkPad as so):
>
>514 data a ;
>515 retain lname 'Galt' fname 'John' ;
>516 do _n_ = 1 to 1e7 ;
>517 output ;
>518 end ;
>519 run ;
>NOTE: The data set WORK.A has 10000000 observations and 2 variables.
>NOTE: DATA statement used (Total process time):
> real time 6.56 seconds
> cpu time 2.57 seconds
>
>520 data _null_ ;
>521 retain fname;
>522 set a;
>523 run;
>NOTE: There were 10000000 observations read from the data set WORK.A.
>NOTE: DATA statement used (Total process time):
> real time 1.51 seconds
> cpu time 1.51 seconds
>
>524 data _null_ ;
>525 if _n_ eq 1 then do;
>526 retain fname;
>527 end;
>528 set a;
>529 run;
>NOTE: There were 10000000 observations read from the data set WORK.A.
>NOTE: DATA statement used (Total process time):
> real time 1.54 seconds
> cpu time 1.51 seconds
>
>530 data _null_ ;
>531 retain fname;
>532 set a;
>533 run;
>NOTE: There were 10000000 observations read from the data set WORK.A.
>NOTE: DATA statement used (Total process time):
> real time 1.48 seconds
> cpu time 1.48 seconds
>
>534 data _null_ ;
>535 if _n_ eq 1 then do;
>536 retain fname;
>537 end;
>538 set a;
>539 run;
>NOTE: There were 10000000 observations read from the data set WORK.A.
>NOTE: DATA statement used (Total process time):
> real time 1.54 seconds
> cpu time 1.54 seconds
>
>However, even though the steps compared as I expected (i.e. executing a
conditional statement 10 million times costs more than nothing) I would
not draw the definite conclusion based on this comparison because the
background input noise still mars the measurement.
>
>The analogy I usually use in this sort of situation is that it is
physically impossible to use a weigh station scale to weigh a fly by
subtracting the weight of an elephant with the fly on its behind measured
from the weight of the bare-ass elephant, for the difference will be
inevitably dwarfed by the measurement errors. To weigh the fly, one needs
to eliminate the elephant from the picture and weigh the fly (preferably
not airborne) itself using a precision scale.
>
>In this case, eliminating the elephant would mean:
>
>602 data _null_ ;
>603 lname = 'Galt' ;
>604 fname = 'John' ;
>605 do _n_ = 1 to 5e9 ;
>606 retain fname ;
>607 end ;
>608 run ;
>NOTE: DATA statement used (Total process time):
> real time 1:17.40
> cpu time 1:17.35
>
>609 data _null_ ;
>610 lname = 'Galt' ;
>611 fname = 'John' ;
>612 do _n_ = 1 to 5e9 ;
>613 if _n_ = 1 then do ;
>614 retain fname ;
>615 end ;
>616 end ;
>617 run ;
>NOTE: DATA statement used (Total process time):
> real time 1:21.50
> cpu time 1:21.23
>
>Note SAS kis so blazingly fast in the execution of the conditional
statement that I have been able to detect a measurable difference (and
that is after eliminating all I/O!) by iterating the loops over a billion
times. Iterating them 10 million times only has resulted in 0.15 seconds
for each step, the difference being beyond the accuracy.
>
>Of course, to my mind, all the measurements with RETAIN between IF and DO
are a funny exercise not unlike an experiment I would stage to prove to
myself that it is impossible to build a perpetuum mobile, because I know
from the onset that at the run time, SAS simply does not see RETAIN (all
its actions have been completed at the compile time beforehand). A good
hint at the RETAIN not having been intended to be run conditionally is
that the "instruction"
>
>if _n_ = 1 then do retain fname ;
>
>will not even compile -- a RETAIN statement must begin with the RETAIN
keyword right after the preceding semicolon. That is why it compiles
within the DO-END block, although at the run time SAS sees no difference
whatsoever between
>
>if _n_ = 1 then do ;
> retain fname ;
>end ;
>
>and
>
>if _n_ = 1 then do ;
>end ;
>
>Kind regards
>------------
>Paul Dorfman
>Jax, FL
>------------
>
>-------------- Original message ----------------------
>From: Arthur Tabachneck <art297@NETSCAPE.NET>
>>
>> One of our most respected list members wrote me off-line, asking why in
>> the world I would have suggested wrapping a retain statement within a
>> condition.
>>
>> That is, given the following data:
>>
>> data have;
>> input lname$ fname$;
>> do i=1 to 1000000;output;end;
>> cards;
>> lname1 fname1
>> lname2 fname2
>> ;
>>
>> why write:
>>
>> data want;
>> if _n_ eq 1 then do;
>> retain fname;
>> end;
>> set have;
>> run;
>>
>> instead of:
>> data want;
>> retain fname;
>> set a;
>> run;
>>
>> I know why I provided the solution, because it had better performance,
but
>> I could sure use some feedback explaining why that would be so.
>>
>> I initially wrote it correctly and, upon seeing that it worked slower
than
>> Jiann's SQL solution, tried to see if I could bypass reading the data
>> (i.e., when _n_ eq 0).
>>
>> After I soon realized that wouldn't be possible, I ran the step as
>> presented.
>>
>> Someone please explain to me why:
>>
>> 60 data want;
>> 61 if _n_ eq 1 then do;
>> 62 retain fname;
>> 63 end;
>> 64 set a;
>> 65 run;
>>
>> NOTE: There were 2000000 observations read from the data set WORK.A.
>> NOTE: The data set WORK.WANT has 2000000 observations and 3 variables.
>> NOTE: DATA statement used (Total process time):
>> real time 1.12 seconds
>> cpu time 1.12 seconds
>>
>> runs almost 50% faster than:
>> 56 data want;
>> 57 retain fname;
>> 58 set a;
>> 59 run;
>>
>> NOTE: There were 2000000 observations read from the data set WORK.A.
>> NOTE: The data set WORK.WANT has 2000000 observations and 3 variables.
>> NOTE: DATA statement used (Total process time):
>> real time 1.43 seconds
>> cpu time 1.43 seconds
>>
>> I ran the tests on a 4-processor Window's 2003 system with 12 gig of ram
>> and SAS 9.1.3. It was during a holiday, thus I was the only one using
the
>> computer and I re-ran the tests 3 times with the same results.
>>
>> Art
>> --------
>> On Mon, 18 Feb 2008 23:21:23 -0500, Arthur Tabachneck
>> <art297@NETSCAPE.NET> wrote:
>>
>> >Miguel,
>> >
>> >As Jiann indicated, you can do what you want with proc sql. However,
you
>> >can also accomplish the same thing in a data step. For example,
>> >
>> >data have;
>> > input lname$ fname$;
>> > do i=1 to 1000000;output;end;
>> > cards;
>> > lname1 fname1
>> > lname2 fname2
>> > ;
>> >
>> >data want;
>> > if _n_ eq 1 then do;
>> > retain fname;
>> > end;
>> > set have;
>> >run;
>> >
>> >HTH,
>> >Art
>> >---------
>> >On Tue, 19 Feb 2008 02:55:04 +0000, Miguel de la Hoz
<miguel_hoz@YAHOO.ES>
>> >wrote:
>> >
>> >>I am starting my problem with the following disposal of my dataset:
>> >
>> ># variable
>> >1 lname
>> >2 fname
>> >
>> >I am trying to export it to excel but it is keeping that order. I would
>> >like to be able to write
>> >
>> ># variable
>> >1 fname
>> >2 lname
>> >
>> >This is only an example my dataset contains around 20 fields.
>> >
>> >Thanks.
>> >
>> >MDH.
>> >
>> >
>> >
>> >______________________________________________
>> >¿Con Mascota por primera vez? Sé un mejor Amigo. Entra en Yahoo!
>> >Respuestas http://es.answers.yahoo.com/info/welcome
|