orlp / pdqsort Goto Github PK

Pattern-defeating quicksort.

License: zlib License

Python 11.10% C++ 88.90%

pdqsort's Introduction

pdqsort

Pattern-defeating quicksort (pdqsort) is a novel sorting algorithm that combines the fast average case of randomized quicksort with the fast worst case of heapsort, while achieving linear time on inputs with certain patterns. pdqsort is an extension and improvement of David Mussers introsort. All code is available for free under the zlib license.

Best        Average     Worst       Memory      Stable      Deterministic
n           n log n     n log n     log n       No          Yes

Usage

pdqsort is a drop-in replacement for std::sort. Just replace a call to std::sort with pdqsort to start using pattern-defeating quicksort. If your comparison function is branchless, you can call pdqsort_branchless for a potential big speedup. If you are using C++11, the type you're sorting is arithmetic and your comparison function is not given or is std::less/std::greater, pdqsort automatically delegates to pdqsort_branchless.

Benchmark

A comparison of pdqsort and GCC's std::sort and std::stable_sort with various input distributions:

Compiled with -std=c++11 -O2 -m64 -march=native.

Visualization

A visualization of pattern-defeating quicksort sorting a ~200 element array with some duplicates. Generated using Timo Bingmann's The Sound of Sorting program, a tool that has been invaluable during the development of pdqsort. For the purposes of this visualization the cutoff point for insertion sort was lowered to 8 elements.

The best case

pdqsort is designed to run in linear time for a couple of best-case patterns. Linear time is achieved for inputs that are in strictly ascending or descending order, only contain equal elements, or are strictly in ascending order followed by one out-of-place element. There are two separate mechanisms at play to achieve this.

For equal elements a smart partitioning scheme is used that always puts equal elements in the partition containing elements greater than the pivot. When a new pivot is chosen it's compared to the greatest element in the partition before it. If they compare equal we can derive that there are no elements smaller than the chosen pivot. When this happens we switch strategy for this partition, and filter out all elements equal to the pivot.

To get linear time for the other patterns we check after every partition if any swaps were made. If no swaps were made and the partition was decently balanced we will optimistically attempt to use insertion sort. This insertion sort aborts if more than a constant amount of moves are required to sort.

The average case

On average case data where no patterns are detected pdqsort is effectively a quicksort that uses median-of-3 pivot selection, switching to insertion sort if the number of elements to be (recursively) sorted is small. The overhead associated with detecting the patterns for the best case is so small it lies within the error of measurement.

pdqsort gets a great speedup over the traditional way of implementing quicksort when sorting large arrays (1000+ elements). This is due to a new technique described in "BlockQuicksort: How Branch Mispredictions don't affect Quicksort" by Stefan Edelkamp and Armin Weiss. In short, we bypass the branch predictor by using small buffers (entirely in L1 cache) of the indices of elements that need to be swapped. We fill these buffers in a branch-free way that's quite elegant (in pseudocode):

buffer_num = 0; buffer_max_size = 64;
for (int i = 0; i < buffer_max_size; ++i) {
    // With branch:
    if (elements[i] < pivot) { buffer[buffer_num] = i; buffer_num++; }
    // Without:
    buffer[buffer_num] = i; buffer_num += (elements[i] < pivot);
}

This is only a speedup if the comparison function itself is branchless, however. By default pdqsort will detect this if you're using C++11 or higher, the type you're sorting is arithmetic (e.g. int), and you're using either std::less or std::greater. You can explicitly request branchless partitioning by calling pdqsort_branchless instead of pdqsort.

The worst case

Quicksort naturally performs bad on inputs that form patterns, due to it being a partition-based sort. Choosing a bad pivot will result in many comparisons that give little to no progress in the sorting process. If the pattern does not get broken up, this can happen many times in a row. Worse, real world data is filled with these patterns.

Traditionally the solution to this is to randomize the pivot selection of quicksort. While this technically still allows for a quadratic worst case, the chances of it happening are astronomically small. Later, in introsort, pivot selection is kept deterministic, instead switching to the guaranteed O(n log n) heapsort if the recursion depth becomes too big. In pdqsort we adopt a hybrid approach, (deterministically) shuffling some elements to break up patterns when we encounter a "bad" partition. If we encounter too many "bad" partitions we switch to heapsort.

Bad partitions

A bad partition occurs when the position of the pivot after partitioning is under 12.5% (1/8th) percentile or over 87,5% percentile - the partition is highly unbalanced. When this happens we will shuffle four elements at fixed locations for both partitions. This effectively breaks up many patterns. If we encounter more than log(n) bad partitions we will switch to heapsort.

The 1/8th percentile is not chosen arbitrarily. An upper bound of quicksorts worst case runtime can be approximated within a constant factor by the following recurrence:

T(n, p) = n + T(p(n-1), p) + T((1-p)(n-1), p)

Where n is the number of elements, and p is the percentile of the pivot after partitioning. T(n, 1/2) is the best case for quicksort. On modern systems heapsort is profiled to be approximately 1.8 to 2 times as slow as quicksort. Choosing p such that T(n, 1/2) / T(n, p) ~= 1.9 as n gets big will ensure that we will only switch to heapsort if it would speed up the sorting. p = 1/8 is a reasonably close value and is cheap to compute on every platform using a bitshift.

pdqsort's People

Contributors

Stargazers

Watchers

Forkers

tml sehe devo2012 weissan meai1 trianglespct se77en puranjaigarg996 ericchen2027 ttldtor starbuck10 wlmgithub theactualwalko yury-pachin ryanguest jasongithub vdt dodng dnbaker hhy5277 palcode lenaic idianale shabesoglu mojang mldbai weimingtom tartavull oyefremov nangal bmharper opsocket optimalbucket vookimedlo aditya7fb mottmatt cc5nth zhuomingliang stewartcanva feicheng-oi dendisuhubdy haikusw 5l1v3r1 michaelaxtmann shards-lang davepuchyr clayne alfex4936 ips4o jsilvao crabbly org-mars olegnn jrade gsjaardema justanotherdot kindofblue denis2glez ducrp chandanksingh matsjonsson ajunlonglive liujp junior-2016 asdlei99 firedtoad zhaofuheng edoardopirovano leoyy sdicke dut3062796s b-xiang wedusk101 phial3 felix0080 zruibin yujiang0030 super-rain serein404 zjs604381586 xiaoyi1972 aaaaaaron yanjunhaha feiyunwill cleohari i10416 drruisseau kermado thamwangjun leoska mdejong goldsteinn beamiter zard-c ymatrix-data refresh-bio-sandbox allaboutevemirolive icantjuddle anabii99

pdqsort's Issues

pdqsort is slower than qsort when compare function is slow

I tried to use pdqsort as an implementation of Array#sort method for the programming language Ruby. Unfortunately, it was about twice slower than the current implementation (based on glibc's qsort(3)) against random input.

It seems that pdqsort tends to call compare function more often than qsort(3). Here is a self-contained demonstration:

$ g++ -O3 -o cmp-test cmp-test.cpp

$ ./cmp-test qsort
1536197

$ ./cmp-test pdqsort
1861162

$ ./cmp-test std::sort
1978708

#include <vector>
#include "pdqsort.h"

#define N 100000

int counter;

int cmp(const void *a, const void *b) {
    long c = *(long*)a - *(long*)b;
    counter++;
    return c < 0 ? -1 : c > 0 ? 1 : 0;
}

int main(int argc, char *argv[]) {
    std::vector<long> vec;

    // reproducible random input
    long x = 1;
    for (int i = 0; i < N; i++) {
        vec.push_back(x);
        x = (48271 * x) & 0xffffffff;
    }

    counter = 0;

    std::string mode(argv[1]);

    if (mode == "qsort") {
        qsort(vec.data(), vec.size(), sizeof(long), cmp);
    }
    else if (mode == "pdqsort") {
        pdqsort(vec.begin(), vec.end(), [] (const auto &a, const auto &b) {
                counter++;
                return a < b;
        });
    }
    else if (mode == "std::sort") {
        std::sort(vec.begin(), vec.end(), [] (const auto &a, const auto &b) {
                counter++;
                return a < b;
        });
    }

    // test
    //for (int i = 0; i < N; i++) {
    //    printf("%ld\n", vec[i]);
    //}

    printf("%d\n", counter);
}

Because Ruby's compare function is slow (it invokes Ruby-level method for each comparison), qsort(3) implementation is faster.

Is this a known characteristic of pdqsort? I know that there is no perfect sorting algorithm in every respect, but I'd like to report this just for the possibility that it may be an unknown room for improvement. If it is known, I'm sorry, feel free to close this issue. If there is a way to reduce the number of comparison at the expense of some speed, I'd be glad to hear about it.

Support for C++17 parallel execution?

As a drop-in replacement for std::sort, do you have plans to add support for parallel execution in pdqsort?

As I'm sure you know, sort can now do this,

std::sort(std::execution::par, lines.begin(), lines.end());

Branchless Lomuto

Hi Orson,

My name is Ellie.

I am a computer science graduate saving to return to school, and I have some extra time on my hands. I am primarily interested in researching AI but also dabble in sorting. I make money working on embedded systems, where memory overhead discourages most library authors from using quicksort. I am attempting to write a generic library sort in C for these systems (with no more than 30*8 bytes of static RAM for the interval stack, preferably less).

During my embedded sorting journey, I've written many branchful pdqsort variants. I've even sped things up slightly with 2-3 pivot partitioning. Still, I couldn't compete with state-of-the-art partitioning schemes like block partitioning. I couldn't use them either, as they seemed to require significant (albeit constant) extra memory.

A few nights ago, I was thrilled to read your blog post and Lukas Bergdoll's write-up on branchless Lomuto partitioning, and I needed to try it for myself. Of course, nothing could have prepared me for its actual performance (on arrays of integers, at least). It is safe to say it blew me away.

I found that a pdqsort variant with branchless Lomuto can outperform the current pdqsort on several patterns in your benchmark, most notably shuffled data.

As you know, Lomuto partitioning does suffer on descending data. However, I also found that swap-rotating data when it is likely descending seems to remedy this without significantly affecting performance on the other patterns.

I think it would be very cool if you could incorporate your Lomuto partitioning into your pdqsort (for integer types).

Here is my little experiment (There may be semantic errors. I haven't slept in a bit lol.)

Cheers,

Ellie

pdqselect?

I'm using pqdsort in my ranges library (slightly modified to support projections, proxy iterators and constexpr evaluation). I think it's fantastic, and it's proven almost always faster (and never slower) than both libc++ and libstdc++ std::sort() in every benchmark I've tried.

Have you given any thought to writing a companion "pdqselect" algorithm to implement std::nth_element()? As far as I know all the mainstream C++ implementations of nth_element use introselect, but it seems that (in principle at least) this algorithm could benefit from the same optimisations that pdqsort uses relative to introsort.

(A quick Google turned up this Rust crate, but it doesn't seem to be part of the Rust standard library yet.)

pdqsort interface cmake

Hi,
I'd like to attempt to sue pdqsort through some package manager for C++ like CPM.

To that end, I'd need this repo to incorporate some minimal CMakeLists.txt file that would roughly look like:

project(pdqsort CXX)

set(CMAKE_CXX_STANDARD 11)

add_library(pdqsort INTERFACE)
target_include_directories(pdqsort INTERFACE .)

This would be sufficient for CPM to figure out what compiler flags to pass to other cmake based projects that want to use pdqsort.

Is that OK from your side @orlp or would you rather not include anything like that in your repo?

Integer overflows in partition_right_branchless

Looks like there are multiple possible integer overflows in partition_right_branchless.

The first kind of overflow happens on the case when the sequence is already correctly partitioned. In this case first >= last and therefore there is an overflow in the loop condition: while (last - first > 2 * block_size) { This could be fixed by this tiny patch (I have not changed the indentation to show the idea):

--- a/pdqsort.h
+++ b/pdqsort.h
@@ -224,7 +224,6 @@ namespace pdqsort_detail {
         if (!already_partitioned) {
             std::iter_swap(first, last);
             ++first;
-        }

         // The following branchless partitioning is derived from "BlockQuicksort: How Branch
         // Mispredictions don’t affect Quicksort" by Stefan Edelkamp and Armin Weiss.
@@ -325,6 +324,7 @@ namespace pdqsort_detail {
             while (num_r--) std::iter_swap(last - offsets_r[num_r], first), ++first;
             last = first;
         }
+        }

         // Put the pivot in the right place.
         Iter pivot_pos = first - 1;

However, the use of int variables there is also a bit suspicious. I have not checked all possible code paths to ensure that they cannot overflow in some cases though

error handling

Fantastic work. Not an issue more of a question.
standard C++ lib is not "exception free", but I will ask anyway.
Can we use this lib in "exception free" programs? Using let's say MSVC.

Benchmark suggestion - pdqsort vs Quicksort 'Magnetica'

Hi Orson,
just was asked how my Quicksort 'Magnetica' fares against pdqsort:
Gaming32/ArrayV#73 (comment)

Please consider running them in some (VISUAL) benchmark. Love seeing those bars moved around.
Here it is, in qsm.h:
https://www.qb64.org/forum/index.php?topic=3518.msg138505#msg138505

Compile warnings MSVC142/143

When sorting 64 bit ints I got some warnings on windows/MSVC142
pdqsort\pdqsort.h(278,26): warning C4244: 'initializing': conversion from '__int64' to 'int', possible loss of data ...
pdqsort\pdqsort.h(143,1): warning C4244: '+=': conversion from '__int64' to 'int', possible loss of data

I can resolve them locally by replacing int with int64

Including to Boost.Sort

Hello.

I have seen, that you wanted to include pdfqsort algorithm to Boost.Sort. What the status of integrating to Boost.Sort now? Can i help you in any way?

pdqsort requires Iter to be default constructible

Currently pdqsort require Iter to be default constructible due to the following code in pdqsort_loop:

std::pair<Iter, bool> part_result;
if (Branchless) part_result = partition_right_branchless(begin, end, comp);
else part_result = partition_right(begin, end, comp);

Changing this to

std::pair<Iter, bool> part_result =
  (Branchless ?
    partition_right_branchless(begin, end, comp) : partition_right(begin, end, comp));

fixes the problem.

Check for i586 instead of i386

When defining rdtsc, you ought to be checking for atleast i586 instead of i386.