Date: Wed, 2 Apr 1997 10:21:11 EDT
Reply-To: whitloi1@WESTATPO.WESTAT.COM
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: Ian Whitlock <whitloi1@WESTATPO.WESTAT.COM>
Subject: Re[2]: FW: FW: why did this code take 22 hours to run?
Subject: Re: FW: FW: why did this code take 22 hours to run?
Summary: SAS not humor.
Ian Whitlock <whitloi1@westat.com>
The following is the twisted tale started by having Rich Livornese
<RICH@BOALAN.ATTMAIL.COM> ask why a SET - BY on an indexed data set
took 22 hours?
> 1726 DATA HHC.ERCPI;
> 1727 SET HHC.ER;
> 1728 BY CPI_KEY;
> 1729 IF FIRST.CPI_KEY;
> 1730 KEEP CPI_KEY;
> 1731 RUN;
>
> NOTE: The data set HHC.ERCPI has 1345222 observations and 1 variables.
> NOTE: The DATA statement used 22 hours 21 minutes 15.77 seconds.
Laurie Flemming suggested using a view <lfleming@actrix.gen.nz> and
sorting the view. Then Clinton S Rickards <RickardsCS@AETNA.COM>
responded changing Laurie's name and saying that a sort with the KEEP=
option was good enough. I took the opportunity (Don Stanley mentioned
Laurie in San Diego) to point out to Clinton various
misunderstandings. In particular that the Laurie's view was a good
idea because the KEEP= option on input to PROC SORT is actually
processed on the sorted file.
This does not mean that one does not have to keep the sort variable. As
shown by Rich's test
> 68 data aa;
> 69 array a1 (10) x1-x10;
> 70 do i=10 to 1 by -1;
> 71 do j=1 to 10;
> 72 a1(j)=i*10;
> 73 end;
> 74 output;
> 75 end;
> 76
> 77
>
> NOTE: The data set WORK.AA has 10 observations and 12 variables.
> NOTE: The DATA statement used 0.2 seconds.
>
>
> 78 proc sort data=aa(keep=x1-x5)
> 79 out=bb;
> 80 by i;
> ERROR: Variable I not found.
> 81
To get back to the point, here is a log (SAS 6.11 under Windows
3.11) showing how a view can save time over using the KEEP= option on a
big sort.
35 data w1 w2 ( keep = i ) ;
36 retain a1 - a20 'xxxxxxxxxxxxxxxxxxxxxxxxxxx' ;
37 do i = 20000 to 1 by - 1 ; output ; end ;
38 run ;
NOTE: The data set WORK.W1 has 20000 observations and 21 variables.
NOTE: The data set WORK.W2 has 20000 observations and 1 variables.
NOTE: The DATA statement used 11.52 seconds.
39
40 proc sort data = w1 ( keep = i ) out = t ;
41 by i ;
42 run ;
NOTE: The data set WORK.T has 20000 observations and 1 variables.
NOTE: The PROCEDURE SORT used 42.52 seconds.
43
44 proc sort data = w2 out = t ;
45 by i ;
46 run ;
NOTE: The data set WORK.T has 20000 observations and 1 variables.
NOTE: The PROCEDURE SORT used 1.69 seconds.
47
48 data w3 / view = w3 ;
49 set w1 ( keep = i ) ;
50 run ;
NOTE: DATA STEP view saved on file WORK.W3.
NOTE: The original source statements cannot be retrieved from a stored DATA STEP
view nor will
a stored DATA STEP view run under a different release of the SAS system or
under a
different operating system.
Please be sure to save the source statements for this DATA STEP view.
NOTE: The DATA statement used 0.66 seconds.
51
52 proc sort data = w3 out = t ;
53 by i ;
54 run ;
NOTE: The view WORK.W3.VIEW used 7.84 seconds.
NOTE: The data set WORK.T has 20000 observations and 1 variables.
NOTE: The PROCEDURE SORT used 8.4 seconds.
The KEEP= method took 42.52 seconds and the view took 0.66 + 8.4
seconds. That is a reduction in time by almost a factor of 5. Of
course the sort of a narrow file wins with 1.69 seconds. The question
is what does it cost to create the narrow file.
In a SUGI 22 Coder's Corner paper John Whittington points out that the
overhead cost of using a view is sufficiently large that it is often
just as fast to make a narrow data set before the sort as it is to use
a view. In any case either is very much more efficient than using
indexing or sorting with the KEEP= option.
Finally in connection with this question, Judy Loren mentioned a horror
story with sorting a view under MVS. Was the OUT= parameter used?
(Under Windows one simply gets a message that the data set can not be
created since a view with that name already exists.) Has the problem
been fixed in more recent releases? How about some testing from some
MVS people?
Ian Whitlock