```Date: Tue, 27 Oct 1998 16:03:44 -0500 Reply-To: "Dorfman, Paul" Sender: "SAS(r) Discussion" From: "Dorfman, Paul" Subject: Summary: Sorting Content-Type: text/plain; charset="iso-8859-1" Dear SAS-L, Raymond King has initiated an interesting thread involving several respondents. Instead of replying separately, I decided to bring my \$.02 of contribution under "one roof". Raymond, in part, wrote: ========================= Does anyone know what algorithm is used as the default with PROC SORT. The recent discussion of SORTing efficiency jump-started my curiosity. Paul Dorfman's 'Insertion Sort' solution to a recent question was well received and proved to be "three times as fast" as other methods. Yet, Insertion Sort is on the Order of N**2 in complexity, and there are more efficient sorting algorithms. It would be interesting to see how much better things can get using an algorithm that's on the Order of N complexity. ========================= There are two unrelated issues: 1) What is the underlying PROC SORT algorithm? and 2) what scheme could have been used in that particular program instead of insertion sort to speed things up? I do not know the exact answer to the first question but know exactly that it is not insertion sort. It would be beneficial to PROC SORT performance if there were two-three algorithms that PROC SORT could intelligently select from depending on input. I doubt this is the case. More probably, one good all-around scheme is used. Quicksort, Heapsort, some variety of merge sort may all be good candidates. The answer to the second question: With 10 elements to be sorted, it did not really matter what algorithm to choose from, so I chose the shortest. Remember, the task was to sort an array of 10 character variables in every observation, and that is why, not because of some intrinsic advantages of straight insertion, the code ran "three times as fast" as other methods. In that case, even if the sorting step in the DATA-step-PROC-SORT-DATA-step method took zero seconds to run, it would not make any difference due to the nature of the task. All we had to do was to sort many small object already residing in memory; the worst sorting routine will do that faster than the best sorting routine accepting all this data as one huge randomly unordered disk file. "Good" sorts and "bad" sorts differ not as much in how fast they sort small inputs but in how well they handle large ones. That is where the N**2 law begins to take its full toll. For large randomly ordered input, insertion sort is ineffective. Yet there are certain inputs it sorts exceptionally well because of its very nature, i.e. the inputs that are either completely ordered or somewhat pre-ordered. For this reason, many much more efficient and complex sorting algorithms, for instance, quicksort, use straight insertion as their last stage, when the size of the largest disordered partition becomes sufficiently small. In his reply to Raymond, John Whittington wrote: ======================== Raymond - from the vague dusty depths of my memory, I thought there was a theoretical minimum of around NlogN for a sort of a random set of data? 'Order N' would only enable one to look at each element once, and it's difficult to see how any sort could be undertaken on that basis. ======================== Generally speaking, John is quite right. Yet there is a case when a totally random input can be sorted in the linear time with the most primitive algorithm imaginable: distribution counting sort, which is applicable (and is the fastest amongst all existing sorting schemes) when key values fall into a limited range. It is linear exactly for the reason John indicated: it sorts its object completely by passing it just once. Typical examples include sorting a digital string, one- or two-byte character array, or integer array with limited set of values. Two or three days ago, Peter Crawford and I applied this method (implicitly) to sort a macrovariable. Michael Hines replied to the original Raymond's question by writing, in part: ================================= SAS User's Guide: Basics: Version 5 Edition, pp 1033 states "The way PROC SORT operates depends on the operating system and the sort utility being used (Note 1). For many PROC SORT applications, the sort utility used makes no difference. However, if you need to know more about the sort utility or utilities at your installation, or where to find documentation on sort utilities, check with your installation's technical staff.". On our mainframe we have SYNCSORT installed, and have this set up as the default option to be used. So the answer to the method used is in SYNCSORT documentation, not SAS documentation. If you look at the references by OS at the end of the chapter, there are a variety of methods and options used. I'll not list all of them here. Suffice to say "it depends on the OS, SAS uses the native sort utility on the OS". ================================== This text creates an impression that SAS uses host sort at all times. It is not true. By default, when the option SORTPGM=BEST is in effect, SAS chooses either its native sort coded in the underlying software or the host sort residing in the system. Or, it can be forced to use either one by assigning SAS or HOST to the option. Performance differences between the two are marginal, and for a good reason. Host sorts, be it ICEMAN, DFSORT or SyncSort, cannot sort SAS files in their native format. Therefore, whenever HOST is selected by SAS of by the programmer, the SAS file is converted behind the scenes into a format the host can sort, that is, a flat file, the latter is sorted, and then converted back into a SAS dataset. So, even though host sorts vastly out-perform the native PROC SORT algorithm given equal amount of data, their performance advantage is offset by the huge additional I/O burden. This behavior is radically different from that of PROC SYNCSORT, a completely separate product which is by no means a host sort in the above sense, as it sorts SAS files in their native format. In its manufacturer's own words, "PROC SYNCSORT achieves such gains by providing a direct interface between SAS programs and SyncSort MVS. This frees SyncSort to use the high-performance techniques--sophisticated access methods, path length minimization, I/O optimization...". And yes, as opposed to host sorts, PROC SYNCSORT results in truly incredible performance gains compared to the native PROC SORT algorithm. Later on, John Whittington added to his previous posting: ============================= Whoops - I forget to include the most important part of my answer in my previous message! I may be wrong, but I had always been led to believe that PROC SORT uses one of the 'fastest in the average case' algorithms, probably a Quick Sort. It is certainly 'fast' and seemingly quite well optimised. People who 'try to be clever' in attempts to speed up sorting (e.g. by splitting large data sets into bits, sorting them separately and then merging) invariably end up with worse performance that PROC SORT - and I'm pretty sure that I've never seen any sorting code implemented in SAS DATA step code which has come close to PROC SORT. Indeed, it's actually quite difficult to implement something like a Quick Sort in SAS code, because of the problems of recursive programming in SAS. ============================= Whatever algorithm is used by PROC SORT, I definitely disagree with the statement that it is "certainly fast and seemingly well optimized". Judging from the indisputable FACT that on the same machine, sorting the same SAS file, PROC SYNCSORT gets the job done 2 to 5 times faster using 80% less I/O, there seems to be some room where PROC SORT could be improved. Another point against its being well optimized is an apparent bug related to its behavior with KEEP= or DROP= options. Instead of accepting into sort only those variables that are to be kept, PROC SORT sorts all input data and drops unnecessary variables at the end of the process. It was convincingly demonstrated by Michael A. Raithel at SESUG'98 in his paper "What Sort of Input Should We Input to a Sort?". He showed, in particular, that code like DATA DS_VIEW (KEEP=A B) / VIEW=DS_VIEW; SET DS_IN; RUN; PROC SORT DATA=DS_VIEW OUT=DS_OUT; BY A B; RUN; runs much faster than PROC SORT DATA=DS_IN (KEEP=A B) OUT=DS_OUT; RUN; and Ian Whitlock was quick to point out where precisely the root of the evil lies. Of course, folks in SI are fully aware of all these circumstances, but must think that it is unimportant. On the other hand, I definitely agree that it is impossible to speed up sorting by "splitting large data sets into bits, sorting them separately and then merging" since the only advantage that can be thus gained is additional EXCPs. This issue, however is vaguely related to the question whether a custom sorting routine implemented in the SAS DATA step code can "come close to PROC SORT" in performance. The latter depends on the task in hand and the nature of data. Whilst trying to code a DATA step sort equivalent to PROC SORT in all generality of the latter is hardly rational, there is a plenitude of instances where it is not difficult and sufficiently justified by improved performance. Here is one example taken from SAS-L. A SAS dataset contains 300,000 rows and 3 variables ID, JUNK1, and JUNK2. In all, there are 500,000 IDs in the system. We need to create a driver file of accounts by sorting the input by ID, deleting the duplicates and dropping the junk. Let us create an honest-to-goodness, totally random input and accomplish the task using both PROC SORT and a DATA step version of distribution sort (running 6.12/NT/P-333/128 M/6.12): 24 DATA IN (KEEP=ID JUNK1 JUNK2); 25 DO I=1 TO 300000; 26 ID = CEIL(RANUNI(1)*500000); 27 JUNK1 = RANUNI(2); 28 JUNK2 = RANUNI(3); 29 OUTPUT; 30 END; 31 RUN; NOTE: The data set WORK.IN has 300000 observations and 3 variables. NOTE: The DATA statement used 2.41 seconds. 32 33 PROC SORT DATA=IN (KEEP=ID) OUT=SORTED NODUPKEY; 34 BY ID; 35 RUN; NOTE: 74620 observations with duplicate key values were deleted. NOTE: The data set WORK.SORTED has 225380 observations and 1 variables. NOTE: The PROCEDURE SORT used 4.13 seconds. 36 37 DATA SORTED (DROP=I); 38 ARRAY Z(500000) _TEMPORARY_; 39 DO UNTIL (END); 40 SET IN (KEEP=ID) END=END; 41 Z(ID) + 1; 42 END; 43 DO I=1 TO DIM(Z); 44 IF Z(I) NOTIN(.,0) THEN OUTPUT; 45 END; 46 RUN; NOTE: The data set WORK.SORTED has 225380 observations and 1 variables. NOTE: The DATA statement used 1.73 seconds. It appears that in certain situations, DATA step code can come fairly close to PROC SORT even in its own domain, i.e. sorting SAS datasets. In the case an array needs to be sorted in the middle of the SAS DATA step, it is hardly ever convenient to unload the data to a disk file, sort it using PROC SORT, and read everything back again. If the array is small, straight insertion will do just fine. If, regardless of its size, array's key values fall into a narrow range, distribution counting will do it simpler and faster than anything else. Finally, any array can be sorted reasonably fast by quicksort. Contrary to common illusion, this algorithm does not contain anything intrinsically recursive; merely in languages whose compiler is capable of maintaining recursion stack it is more convenient to program this way. In fact, the scheme of quicksort given by Donald E. Knuth is purely iterative and relies on an explicit stack to keep track of the partitioning process. This is exactly how it can be coded in the SAS DATA step, and I have an iron-clad proof of it in the form of a decently performing program. I hope to present this along with some satellite information in Miami. See you there! Kind regards, Paul. +++++++++++++++++++++++++++++++++ Paul M. Dorfman Citibank UCS Decision Support Jacksonville, FL +++++++++++++++++++++++++++++++++ ```

Back to: Top of message | Previous page | Main SAS-L page