Technology and Business: Under the hoods of sorting

We all use library routines for sorting data. I suspect for more than 99% of applications, there is no good reason to justify any custom implementation. I did some readings to refresh some important points for the popular sorting algorithms.

Popular sorting algorithms
The most popular sorting algorithms used now are still the same as 20 years ago. Merge sort, quick sort, heap sort and insertion sort have been used for a very long time. What has changed drastically during last 20 years is the amount of physical memory we have in computers (we are spoiled now). The abundance of memory makes these internal sorting algorithms even more popular. During the times when amount of memory was limited, we had to study external sorting algorithms which rely on external I/O for large data sets.

Time performance
It is important to understand the average and worse case performance of an algorithm. For example, merge sort, quick sort and heap sort all have average O(n log n) runtime. However, performance of quick sort degrades to O(n²) for a sorted list, with the 1st element selected as pivot.. There are workaround solutions to reduce the chance of hitting the worse case scenario.

If the time consumed in sorting is relatively small (e.g. 5% of total), it might not be worthwhile for us to tune the performance. We should review the other 95% first. Only if there was no room for improvement (highly unlikely), then we should determine if it would still worth our money to tune the remaining 5%.

We should also be aware if stable sort is important or not. Some implementations trade off performance without making stable sort a requirement. Merge sort is a stable sort. Quick sort and heap sort are not.

Memory usage
Nowadays, in most situations, memory will not be the limiting factor. However, it is still important to know the memory requirement. For example, heap sort requires O(1), quick sort requires O(log n) and merge sort requires O(n). If memory is the primary factor, your decision might change, with all three algorithms having similar average time complexity.

Language support
Java uses merge sort, quick sort or insertion sort as default for different data types and array size. Perl moved from quick sort to merge sort (Perl 5.8) recently.

Most of us also do a lot of sorting in relational database using the SQL ORDER BY construct. The obvious question would be how the database sort compares to the in-memory sorts. We can easily do some research on the internet and verify by running on production level hardware for benchmarks. The available application server hardware to perform in-memory sorts could be very different from the database servers..

Takeaways
There is always benefit in understanding some details of a library function with respect to what we are trying to solve. This will help us understand better of what we are doing.

Sorting (or other data transformation and enrichment) using parallel CPUs and computers have gained popularity in recent years. There are a lot of parallel processing algorithms and tools available. MapReduce and Hadoop are two examples.

Please let me know if you have any thoughts or different views on this.

Technology and Business

Friday, May 15, 2009

Under the hoods of sorting

No comments:

Post a Comment

Blog Archive

About Me

Twitter Updates

Twitter Updates

Followers