External sorting is a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in the slower external memory, usually a disk drive. Thus, external sorting algorithms are external memory algorithms and thus applicable in the external memory model of computation.
External sorting algorithms generally fall into two types, distribution sorting, which resembles quicksort, and external merge sort, which resembles merge sort. The latter typically uses a hybrid sort-merge strategy. In the sorting phase, chunks of data small enough to fit in main memory are read, sorted, and written out to a temporary file. In the merge phase, the sorted subfiles are combined into a single larger file.
External sorting algorithms can be analyzed in the external memory model. In this model, a cache or internal memory of size M and an unbounded external memory are divided into blocks of size B, and the running time of an algorithm is determined by the number of memory transfers between internal and external memory. Like their cache-oblivious counterparts, asymptotically optimal external sorting algorithms achieve a running time (in Big O notation) of .
The algorithm first sorts M items at a time and puts the sorted lists back into external memory. It then recursively does a -way merge on those sorted lists. To do this merge, B elements from each sorted list are loaded into internal memory, and the minimum is repeatedly outputted.
For example, for sorting 900 megabytes of data using only 100 megabytes of RAM:
Historically, instead of a sort, sometimes a replacement-selection algorithm was used to perform the initial distribution, to produce on average half as many output chunks of double the length.
The previous example is a two-pass sort: first sort, then merge. The sort ends with a single k-way merge, rather than a series of two-way merge passes as in a typical in-memory merge sort. This is because each merge pass reads and writes every value from and to disk, so reducing the number of passes more than compensates for the additional cost of a k-way merge.
The limitation to single-pass merging is that as the number of chunks increases, memory will be divided into more buffers, so each buffer is smaller. Eventually, the reads become so small that more time is spent on disk seeks than data transfer. A typical magnetic hard disk drive might have a 10 ms access time and 100 MB/s data transfer rate, so each seek takes as much time as transferring 1 MB of data.
Thus, for sorting, say, 50 GB in 100 MB of RAM, using a single 500-way merge pass isn't efficient: we can only read 100 MB / 501 ≈ 200 KB from each chunk at once, so 5/6 of the disk's time is spent seeking. Using two merge passes solves the problem. Then the sorting process might look like this:
Although this requires an additional pass over the data, each read is now 4 MB long, so only 1/5 of the disk's time is spent seeking. The improvement in data transfer efficiency during the merge passes (16.6% to 80% is almost a 5× improvement) more than makes up for the doubled number of merge passes.
Variations include using an intermediate medium like solid-state disk for some stages, even if there isn't enough of it to hold the full dataset. Repeating the example above with 20 GB of temporary storage on SSD, the first pass could merge 200×100 MB sorted chunks read from that temporary space to write 20GB sorted chunks to HDD. The 200-way merge with 500kb buffers is reasonably efficient due to the SSD's lower random-read latency (on the order of 0.1ms), the first pass can benefit from an SSD's higher read/write bandwidth, and the second pass has fewer larger chunks to merge and therefore fewer HDD seeks. Given the relatively low cost of SSD space, it can be an economical tool for sorting large inputs with very limited memory.
Like in-memory sorts, efficient external sorts require O(n log n) time: linear increases in data size require logarithmic increases in the number of passes, and each pass takes a linear number of reads and writes. Using the large memory sizes provided by modern computers the logarithmic factor grows very slowly. Under reasonable assumptions at least 500 GB of data stored on a hard drive can be sorted using 1 GB of main memory before a third pass becomes advantageous, and many times that much data can be sorted before a fourth pass becomes useful. Low-seek-time media like solid-state drives (SSDs) also increase the amount that can be sorted before additional passes improve performance.
Main memory size is important. Doubling memory dedicated to sorting halves the number of chunks and the number of reads per chunk, reducing the number of seeks required by about three-quarters. The ratio of RAM to disk storage on servers often makes it convenient to do huge sorts on a cluster of machines rather than on one machine with multiple passes.
External distribution sort is analogous to quicksort. The algorithm finds approximately pivots and uses them to divide the N elements into approximately equally sized subarrays, each of whose elements are all smaller than the next, and then recurse until the sizes of the subarrays are less than the block size. When the subarrays are less than the block size, sorting can be done quickly because all reads and writes are done in the cache, and in the external memory model requires operations.
However, finding exactly pivots would not be fast enough to make the external distribution sort asymptotically optimal. Instead, we find slightly fewer pivots. To find these pivots, the algorithm splits the N input elements into chunks, and takes every elements, and recursively uses the median of medians algorithm to find pivots.
The Sort Benchmark, created by computer scientist Jim Gray, compares external sorting algorithms implemented using finely tuned hardware and software. Winning implementations use several techniques: