Date: Sun, 20 Apr 1997 20:33:21 EDT
Reply-To: whitloi1@WESTATPO.WESTAT.COM
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: Ian Whitlock <whitloi1@WESTATPO.WESTAT.COM>
Subject: Re[2]: why did this code take 22 hours to run?
Subject: Re: why did this code take 22 hours to run?
Summary: A complete indexed read of a large file can be painfully
slow.
Respondent: Ian Whitlock <whitloi1@westat.com>
Various solutions to Tod Mijanovich's <mijnvcht@is3.nyu.edu> problem
of a simple data step that took 22 hours were offered by various
people, but I do not remember seeing an explanation of why the given
code took so long. It does deserve an explanation.
Apparently (copied from an incorrect anwer) Tod wrote:
> But I'm still stumped by the excruciating slowness of my original
> program. Again, I've run complicated data steps on datasets that were
> both wider (more vars) and longer (more recs) that ran significantly
> faster. I don't want to use the word "bug", but how about "anomaly"?
> I'd be curious to know if anyone can replicate my experience: a data
> step taking a very long time that simply cuts a couple-million-record
> dataset in about half by weeding out records with non-unique keys.
> The reason this might be important to try to replicate is that there
> may be something goofy going on in data step code that usually gets
> masked by the operation of more complicated data step code, but that's
> responsible for taking up more time than the data step needs to take.
Remember the key step here was a simple subsetting step involving a SET
statement with a BY statement where the file was indexed instead of
sorted. To look at the situation I considered the following macro
which generates test data and then subsets it. For N=3 the main data
set size is 1,000, 10,000, and 100,000 observations. (On my entry
level pentium 100,000 using SAS 6.11 under Windows 3.1 is the upper
level of my patience quota.) Essentially the only factors of interest
are the size of the main data set (long and narrow), the simplicity of
the subsetting DATA step, and the fact that the file is indexed.
%macro rep ( n = 3 ) ;
%do i = 1 %to &n ;
%let num = 100 * 10 ** &i ;
data w ( index = ( xy = ( x y ) ) ) ;
do i = 1 to &num ;
x = ranuni ( 86745209 ) ;
y = ranuni ( 86745209 ) ;
output ; output ;
end ;
run ;
data w2 ;
set w ;
by x y ;
if first.y ;
run ;
%end ;
%mend rep ;
%rep(n = 3)
Here are the results.
NOTE: The data set WORK.W has 2000 observations and 3 variables.
NOTE: The DATA statement used 2.08 seconds.
NOTE: The data set WORK.W2 has 1000 observations and 3 variables.
NOTE: The DATA statement used 1.32 seconds.
NOTE: The data set WORK.W has 20000 observations and 3 variables.
NOTE: The DATA statement used 5.33 seconds.
NOTE: The data set WORK.W2 has 10000 observations and 3 variables.
NOTE: The DATA statement used 10.86 seconds.
NOTE: The data set WORK.W has 200000 observations and 3 variables.
NOTE: Timeout waiting for application - possible infinite loop
NOTE: The DATA statement used 1 minute 2.0 seconds.
NOTE: The data set WORK.W2 has 100000 observations and 3 variables.
NOTE: The DATA statement used 32 minutes 1.23 seconds.
Look at the ratio of time to subset to the time to create the data
since it measures the effort of subsetting relative to the effort of
creating. Roughly
1:2 for 1,000 obs
2:1 for 10,000 obs increased by factor of 4
32/1 for 100,000 obs increased by a factor of 16
It is clear that reading large indexed files using a BY statement can
take an awful lot of time. Why?
In the first case I assume a large portion of the index file and
the corresponding records could be held in memory. Consequently
IO was relatively unimportant because adjacent groups were likely to
be in the same physical block. Hence the subsetting time was dominated
by the fact that half the records were eliminated.
In the second case this was less true. Now SAS had to read a new
physical block to get many of the next by groups. So the ratio climbs.
In the third case almost every next by group is in a different physical
block hence the IO starts getting astronomical.
Why was sorting so much more efficient? Here adjacent by groups are
most likely in the same physical block so the IO is stable compared to
the size of the file. In other words subsetting a file twice as large
is twice as hard because there is twice as much IO, instead of many
times as hard because the IO has gone astronomical.
I see it, this as a classic example of how indexing can hurt you. What
is the characteristic of the problem? Indexing is efficient for
locating a few records out of many, but it is very inefficient for
getting all the records. Moreover, the inefficiency grows
exponentially with the size of the file. This is not an anomaly, it is
a logical consequence of the nature of indexing.
If you must do an indexed read on a large file, then make the buffers
as large as possible.
Ian Whitlock