LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 1997, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Sun, 20 Apr 1997 20:33:21 EDT
Reply-To:     whitloi1@WESTATPO.WESTAT.COM
Sender:       "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:         Ian Whitlock <whitloi1@WESTATPO.WESTAT.COM>
Subject:      Re[2]: why did this code take 22 hours to run?

Subject: Re: why did this code take 22 hours to run? Summary: A complete indexed read of a large file can be painfully slow. Respondent: Ian Whitlock <whitloi1@westat.com>

Various solutions to Tod Mijanovich's <mijnvcht@is3.nyu.edu> problem of a simple data step that took 22 hours were offered by various people, but I do not remember seeing an explanation of why the given code took so long. It does deserve an explanation.

Apparently (copied from an incorrect anwer) Tod wrote:

> But I'm still stumped by the excruciating slowness of my original > program. Again, I've run complicated data steps on datasets that were > both wider (more vars) and longer (more recs) that ran significantly > faster. I don't want to use the word "bug", but how about "anomaly"? > I'd be curious to know if anyone can replicate my experience: a data > step taking a very long time that simply cuts a couple-million-record > dataset in about half by weeding out records with non-unique keys. > The reason this might be important to try to replicate is that there > may be something goofy going on in data step code that usually gets > masked by the operation of more complicated data step code, but that's > responsible for taking up more time than the data step needs to take.

Remember the key step here was a simple subsetting step involving a SET statement with a BY statement where the file was indexed instead of sorted. To look at the situation I considered the following macro which generates test data and then subsets it. For N=3 the main data set size is 1,000, 10,000, and 100,000 observations. (On my entry level pentium 100,000 using SAS 6.11 under Windows 3.1 is the upper level of my patience quota.) Essentially the only factors of interest are the size of the main data set (long and narrow), the simplicity of the subsetting DATA step, and the fact that the file is indexed.

%macro rep ( n = 3 ) ; %do i = 1 %to &n ; %let num = 100 * 10 ** &i ; data w ( index = ( xy = ( x y ) ) ) ; do i = 1 to &num ; x = ranuni ( 86745209 ) ; y = ranuni ( 86745209 ) ; output ; output ; end ; run ;

data w2 ; set w ; by x y ; if first.y ; run ; %end ; %mend rep ;

%rep(n = 3)

Here are the results.

NOTE: The data set WORK.W has 2000 observations and 3 variables. NOTE: The DATA statement used 2.08 seconds.

NOTE: The data set WORK.W2 has 1000 observations and 3 variables. NOTE: The DATA statement used 1.32 seconds.

NOTE: The data set WORK.W has 20000 observations and 3 variables. NOTE: The DATA statement used 5.33 seconds.

NOTE: The data set WORK.W2 has 10000 observations and 3 variables. NOTE: The DATA statement used 10.86 seconds.

NOTE: The data set WORK.W has 200000 observations and 3 variables. NOTE: Timeout waiting for application - possible infinite loop NOTE: The DATA statement used 1 minute 2.0 seconds.

NOTE: The data set WORK.W2 has 100000 observations and 3 variables. NOTE: The DATA statement used 32 minutes 1.23 seconds.

Look at the ratio of time to subset to the time to create the data since it measures the effort of subsetting relative to the effort of creating. Roughly

1:2 for 1,000 obs 2:1 for 10,000 obs increased by factor of 4 32/1 for 100,000 obs increased by a factor of 16

It is clear that reading large indexed files using a BY statement can take an awful lot of time. Why?

In the first case I assume a large portion of the index file and the corresponding records could be held in memory. Consequently IO was relatively unimportant because adjacent groups were likely to be in the same physical block. Hence the subsetting time was dominated by the fact that half the records were eliminated.

In the second case this was less true. Now SAS had to read a new physical block to get many of the next by groups. So the ratio climbs.

In the third case almost every next by group is in a different physical block hence the IO starts getting astronomical.

Why was sorting so much more efficient? Here adjacent by groups are most likely in the same physical block so the IO is stable compared to the size of the file. In other words subsetting a file twice as large is twice as hard because there is twice as much IO, instead of many times as hard because the IO has gone astronomical.

I see it, this as a classic example of how indexing can hurt you. What is the characteristic of the problem? Indexing is efficient for locating a few records out of many, but it is very inefficient for getting all the records. Moreover, the inefficiency grows exponentially with the size of the file. This is not an anomaly, it is a logical consequence of the nature of indexing.

If you must do an indexed read on a large file, then make the buffers as large as possible.

Ian Whitlock


Back to: Top of message | Previous page | Main SAS-L page