Date: Tue, 27 Oct 1998 16:03:44 -0500
Reply-To: "Dorfman, Paul" <pdorfma@UCS.ATT.COM>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: "Dorfman, Paul" <pdorfma@UCS.ATT.COM>
Subject: Summary: Sorting
Content-Type: text/plain; charset="iso-8859-1"
Raymond King <raymondk@MOCR.OAPI.COM> has initiated an interesting thread
involving several respondents. Instead of replying separately, I decided to
bring my $.02 of contribution under "one roof". Raymond, in part, wrote:
Does anyone know what algorithm is used as the default with PROC SORT. The
recent discussion of SORTing efficiency jump-started my curiosity. Paul
Dorfman's 'Insertion Sort' solution to a recent question was well received
and proved to be "three times as fast" as other methods. Yet, Insertion
Sort is on the Order of N**2 in complexity, and there are more efficient
sorting algorithms. It would be interesting to see how much better things
can get using an
algorithm that's on the Order of N complexity.
There are two unrelated issues: 1) What is the underlying PROC SORT
algorithm? and 2) what scheme could have been used in that particular
program instead of insertion sort to speed things up?
I do not know the exact answer to the first question but know exactly that
it is not insertion sort. It would be beneficial to PROC SORT performance if
there were two-three algorithms that PROC SORT could intelligently select
from depending on input. I doubt this is the case. More probably, one good
all-around scheme is used. Quicksort, Heapsort, some variety of merge sort
may all be good candidates.
The answer to the second question: With 10 elements to be sorted, it did not
really matter what algorithm to choose from, so I chose the shortest.
Remember, the task was to sort an array of 10 character variables in every
observation, and that is why, not because of some intrinsic advantages of
straight insertion, the code ran "three times as fast" as other methods. In
that case, even if the sorting step in the DATA-step-PROC-SORT-DATA-step
method took zero seconds to run, it would not make any difference due to the
nature of the task. All we had to do was to sort many small object already
residing in memory; the worst sorting routine will do that faster than the
best sorting routine accepting all this data as one huge randomly unordered
"Good" sorts and "bad" sorts differ not as much in how fast they sort small
inputs but in how well they handle large ones. That is where the N**2 law
begins to take its full toll. For large randomly ordered input, insertion
sort is ineffective. Yet there are certain inputs it sorts exceptionally
well because of its very nature, i.e. the inputs that are either completely
ordered or somewhat pre-ordered. For this reason, many much more efficient
and complex sorting algorithms, for instance, quicksort, use straight
insertion as their last stage, when the size of the largest disordered
partition becomes sufficiently small.
In his reply to Raymond, John Whittington <medisci@POWERNET.COM> wrote:
Raymond - from the vague dusty depths of my memory, I thought there was a
theoretical minimum of around NlogN for a sort of a random set of data?
'Order N' would only enable one to look at each element once, and it's
difficult to see how any sort could be undertaken on that basis.
Generally speaking, John is quite right. Yet there is a case when a totally
random input can be sorted in the linear time with the most primitive
algorithm imaginable: distribution counting sort, which is applicable (and
is the fastest amongst all existing sorting schemes) when key values fall
into a limited range. It is linear exactly for the reason John indicated: it
sorts its object completely by passing it just once. Typical examples
include sorting a digital string, one- or two-byte character array, or
integer array with limited set of values. Two or three days ago, Peter
Crawford and I applied this method (implicitly) to sort a macrovariable.
Michael Hines <mshines@PURDUE.EDU> replied to the original Raymond's
question by writing, in part:
SAS User's Guide: Basics: Version 5 Edition, pp 1033 states "The way PROC
SORT operates depends on the operating system and the sort utility being
used (Note 1). For many PROC SORT applications, the sort utility used
makes no difference. However, if you need to know more about the sort
utility or utilities at your installation, or where to find documentation
on sort utilities, check with your installation's technical staff.". On
our mainframe we have SYNCSORT installed, and have this set up as the
default option to be used. So the answer to the method used is in SYNCSORT
documentation, not SAS documentation. If you look at the references by OS
at the end of the chapter, there are a
variety of methods and options used. I'll not list all of them here.
Suffice to say "it depends on the OS, SAS uses the native sort utility on
This text creates an impression that SAS uses host sort at all times. It is
not true. By default, when the option SORTPGM=BEST is in effect, SAS chooses
either its native sort coded in the underlying software or the host sort
residing in the system. Or, it can be forced to use either one by assigning
SAS or HOST to the option. Performance differences between the two are
marginal, and for a good reason. Host sorts, be it ICEMAN, DFSORT or
SyncSort, cannot sort SAS files in their native format. Therefore, whenever
HOST is selected by SAS of by the programmer, the SAS file is converted
behind the scenes into a format the host can sort, that is, a flat file, the
latter is sorted, and then converted back into a SAS dataset. So, even
though host sorts vastly out-perform the native PROC SORT algorithm given
equal amount of data, their performance advantage is offset by the huge
additional I/O burden. This behavior is radically different from that of
PROC SYNCSORT, a completely separate product which is by no means a host
sort in the above sense, as it sorts SAS files in their native format. In
its manufacturer's own words, "PROC SYNCSORT achieves such gains by
providing a direct interface between SAS programs and SyncSort MVS. This
frees SyncSort to use the high-performance techniques--sophisticated access
methods, path length minimization, I/O optimization...". And yes, as opposed
to host sorts, PROC SYNCSORT results in truly incredible performance gains
compared to the native PROC SORT algorithm.
Later on, John Whittington added to his previous posting:
Whoops - I forget to include the most important part of my answer in my
previous message! I may be wrong, but I had always been led to believe that
PROC SORT uses one of the 'fastest in the average case' algorithms, probably
a Quick Sort. It is certainly 'fast' and seemingly quite well optimised.
People who 'try to be clever' in attempts to speed up sorting (e.g. by
splitting large data sets into bits, sorting them separately and then
merging) invariably end up with worse performance that PROC SORT - and I'm
pretty sure that I've never seen any sorting code implemented in SAS DATA
step code which has come close to PROC SORT. Indeed, it's actually quite
difficult to implement something like a Quick Sort in SAS code, because of
the problems of recursive programming in SAS.
Whatever algorithm is used by PROC SORT, I definitely disagree with the
statement that it is "certainly fast and seemingly well optimized". Judging
from the indisputable FACT that on the same machine, sorting the same SAS
file, PROC SYNCSORT gets the job done 2 to 5 times faster using 80% less
I/O, there seems to be some room where PROC SORT could be improved. Another
point against its being well optimized is an apparent bug related to its
behavior with KEEP= or DROP= options. Instead of accepting into sort only
those variables that are to be kept, PROC SORT sorts all input data and
drops unnecessary variables at the end of the process. It was convincingly
demonstrated by Michael A. Raithel at SESUG'98 in his paper "What Sort of
Input Should We Input to a Sort?". He showed, in particular, that code like
DATA DS_VIEW (KEEP=A B) / VIEW=DS_VIEW; SET DS_IN; RUN;
PROC SORT DATA=DS_VIEW OUT=DS_OUT; BY A B; RUN;
runs much faster than
PROC SORT DATA=DS_IN (KEEP=A B) OUT=DS_OUT; RUN;
and Ian Whitlock was quick to point out where precisely the root of the evil
lies. Of course, folks in SI are fully aware of all these circumstances, but
must think that it is unimportant.
On the other hand, I definitely agree that it is impossible to speed up
sorting by "splitting large data sets into bits, sorting them separately and
then merging" since the only advantage that can be thus gained is additional
EXCPs. This issue, however is vaguely related to the question whether a
custom sorting routine implemented in the SAS DATA step code can "come close
to PROC SORT" in performance. The latter depends on the task in hand and the
nature of data. Whilst trying to code a DATA step sort equivalent to PROC
SORT in all generality of the latter is hardly rational, there is a
plenitude of instances where it is not difficult and sufficiently justified
by improved performance. Here is one example taken from SAS-L. A SAS dataset
contains 300,000 rows and 3 variables ID, JUNK1, and JUNK2. In all, there
are 500,000 IDs in the system. We need to create a driver file of accounts
by sorting the input by ID, deleting the duplicates and dropping the junk.
Let us create an honest-to-goodness, totally random input and accomplish the
task using both PROC SORT and a DATA step version of distribution sort
(running 6.12/NT/P-333/128 M/6.12):
24 DATA IN (KEEP=ID JUNK1 JUNK2);
25 DO I=1 TO 300000;
26 ID = CEIL(RANUNI(1)*500000);
27 JUNK1 = RANUNI(2);
28 JUNK2 = RANUNI(3);
NOTE: The data set WORK.IN has 300000 observations and 3 variables.
NOTE: The DATA statement used 2.41 seconds.
33 PROC SORT DATA=IN (KEEP=ID) OUT=SORTED NODUPKEY;
34 BY ID;
NOTE: 74620 observations with duplicate key values were deleted.
NOTE: The data set WORK.SORTED has 225380 observations and 1 variables.
NOTE: The PROCEDURE SORT used 4.13 seconds.
37 DATA SORTED (DROP=I);
38 ARRAY Z(500000) _TEMPORARY_;
39 DO UNTIL (END);
40 SET IN (KEEP=ID) END=END;
41 Z(ID) + 1;
43 DO I=1 TO DIM(Z);
44 IF Z(I) NOTIN(.,0) THEN OUTPUT;
NOTE: The data set WORK.SORTED has 225380 observations and 1 variables.
NOTE: The DATA statement used 1.73 seconds.
It appears that in certain situations, DATA step code can come fairly close
to PROC SORT even in its own domain, i.e. sorting SAS datasets. In the case
an array needs to be sorted in the middle of the SAS DATA step, it is hardly
ever convenient to unload the data to a disk file, sort it using PROC SORT,
and read everything back again. If the array is small, straight insertion
will do just fine. If, regardless of its size, array's key values fall into
a narrow range, distribution counting will do it simpler and faster than
anything else. Finally, any array can be sorted reasonably fast by
quicksort. Contrary to common illusion, this algorithm does not contain
anything intrinsically recursive; merely in languages whose compiler is
capable of maintaining recursion stack it is more convenient to program this
way. In fact, the scheme of quicksort given by Donald E. Knuth is purely
iterative and relies on an explicit stack to keep track of the partitioning
process. This is exactly how it can be coded in the SAS DATA step, and I
have an iron-clad proof of it in the form of a decently performing program.
I hope to present this along with some satellite information in Miami. See
Paul M. Dorfman
Citibank UCS Decision Support