skarupke / ska_sort Goto Github PK

View Code? Open in Web Editor NEW

233.0 27.0 24.0 14 KB

C++ 100.00%

ska_sort's People

Contributors

Stargazers

Watchers

ska_sort's Issues

Applying the "vacancy" half-swap trick to ska sort

As far as I can see, during the phase where each element gets swapped to its final bucket among 256 for the current byte being compared, swapping an element takes (more or less) two swaps: one out of its initial position, and one to its final position. Since swapping takes three assignments for two elements, that translates to about three swaps per element.

Andrei Alexandrescu introduced the "vacancy" trick for Hoare's partition, that basically turns swaps into "half-swaps", reducing three assignments to two (or one and a half per element to one)

This doesn't directly translate to radix sort, but after a bit of tinkering I came up with a way: basically, use two temporaries and ping-pong between them:

initialize the first swap by moving the first element to the second temporary
every odd swap, the element moving to its final position is already stored in the second temporary. Assign the element being "swapped out" to the first temporary, and assign the element in the second temporary to its final position.
every even swap, the element moving to its final position is already stored in the first temporary. Assign the element being "swapped out" to the second temporary, and assign the element in the first temporary to its final position.

I don't know if this wording is easy to follow, but in essence it reduces the three assignments required to move an element to two, which given the impact of memory access could be quite significant

Framework for custom "item comparators"

I propose to make the following framework for custom "item comparators", and to implement built-in functionality on top of it.

Low-level API allows user to choose how many buckets are used on each pass:

template <class LowLevelCustomComparator>
class LowLevelSorter {...}  // Here we should perform all the real work!

template <typename T>
class LowLevelCustomComparator
{
    // Total number of radix-sort passes over data or 0 for variable amount
    static int passes;

    // How many buckets to use on this pass
    template <int pass>  static int buckets();

    // Compute bucket from the item
    template <int pass>  static int bucket (T item);

    // Bucket used for items that don't need any more sorting, i.e. 0 for strings
    template <int pass>  static int final_bucket();
}

High-level API just tells how many bits in the sort key. Library provides implementation of this API that calls into the low-level API using optimal (for this particular box) settings for each pass:

template <class HighLevelCustomComparator>
class HighLevelSorter {...}

template <typename T>
class HighLevelCustomComparator
{
    // Total number of bits in the key or 0 for variable amount
    static int bits;

    // Compute bucket from the item for some pass
    template<int first_bit, int last_bit>  static int bucket (T item);

    // Bucket used for items that don't need any more sorting, i.e. 0 for strings
    template<int first_bit, int last_bit>  static int final_bucket();
}

Finally, highest-level code should implement standard sorting of standard C++ types by constructing HighLevelCustomComparator from the type and provide user with tools to construct comparator f.e. for selected fields in the structure.

ska_sort should reject non-IEEE 754 floating points

If I'm not mistaken, the bit tricks used to reinterpret the floating point numbers as integers are specific to the IEEE 754 representation of floating point numbers, which isn't guaranteed by the C++ standard. Instead of blindly accepting every standard floating point type, you should guard the sort at compile-time with std::numeric_limits<T>::is_iec559 so that it rejects non-IEEE 754 compliant floating points types.

Additionally, you're using type punning through an union between integral and floating point type, which is undefined behaviour. A standard compliant solution would be to use std::memcpy instead.

Please add readme.md with link to description of the sort

Idea: ska_sort based `nth_element`

So I came across a series of blog posts about finding the median value of an array (which, really, is just a special version of the selection algorithm). The last post contains a very simple and very fast technique: use counting sort, but after the counting pass, don't even bother swapping. Just look at which bucket falls in the middle of the array. This must be the median. His implementation of this "counting median" is pretty fast:

Then he laments:

Of course, we get a speed-up because we exploit a special case, where the number of different values is small compared to the length of the list. But hey, why not?

Oh... but is it really such a limited special case? ;)

I bet you have already guessed what I have in mind: take ska_sort, but only recurse on the bucket with the range that contains the median value (it also has to pass an off-set for where the middle is in the sub-range). If you do that, it actually meets all the requirements of std::nth_element:

Rearranges the elements in the range [first,last), in such a way that the element at the nth position is the element that would be in that position in a sorted sequence.

The other elements are left without any specific order, except that none of the elements preceding nth are greater than it, and none of the elements following it are less.

With or without that last optimization this should be blazing fast, even for long radix keys: for uniformly distributed values, the number of elements can be expected to shrink by 256 each time the function recurses. The worst-case input would be an input where the bucket size shrinks a minimal amount (I guess just by one element) - that would just degenerate into ska_sort, so still perform pretty well.

Since most implementations of nth_element use quick_select, a "radix select" like this should probably beat it easily. Even better, it would require relatively little changes to the existing ska_sort code!

I also thought of an even simpler and likely faster method: instead of swapping everything, only swap the values in the median range. To make this easier, swap them to the front. In pseudo-code:

unsigned int i = 0;
// skip all the values of the median range already at the front
while(array[i] == median_range){ i++; }
unsigned int j = i++;
while(i < array_size && j < median_range_size){
  while(array[i] == median_val && j < median_range_size){
    swap(j++, i++);
  }
  i++;
}

Then we recurse over the median range, with an off-set to indicate where the median falls in the sub-range. And like the counting median approach above, we can skip the swapping on the deepest level of recursion, and only do the counting part.

This should be even faster than the previous version, because it reduces the nr of swaps to the bare minimum. The obvious downside is that this would completely scramble the array order.

Make LSD radix sort faster with software write-combining

Naive code: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/RadixSort.cpp#L38
Speed - 19 MKeys/s: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/results.txt#L377

Optimized code: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/RadixSort.cpp#L51
Speed - 94 MKeys/s: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/results.txt#L329

Speed measured on 100M 32-bit uniform keys, sorted by 4 passes of 256-bucket sort, using single core of i7-4770: https://github.com/Bulat-Ziganshin/MT-LZ/blob/master/TestRadixSort.cpp

The optimized code combines data into 64-byte packets stored in temporary buffer, and then writes whole packet into output bucket. This improves TLB usage efficiency and avoids hardware write-combining, thus improving resulting speed (for my code and my computer) 5 times!

Implement almost in-place LSD radix sort

https://cs.stackexchange.com/questions/93563/fast-stable-almost-in-place-radix-and-merge-sorts

vs2015: class template has already been defined

I tried to use this library on a custom type, but it always fails to compile with the listed error.

Here is my radix sort key function:

uint64_t to_radix_sort_key(const Command & c){
return c.Key().FullKey();
}

And the call to ska_sort

ska_sort(commands.begin(), commands.end());

And full error:

ska_sort.hpp(1017): error C2953: 'detail::FallbackSubKey<T,std::enable_if<!std::is_same<void,unknown-type>::value,void>::type>': class template has already been defined
ska_sort.hpp(846): note: see declaration of 'detail::FallbackSubKey<T,std::enable_if<!std::is_same<void,unknown-type>::value,void>::type>'

If I remove one of these overloads, it does compile, but I see no speedup over std::sort, in fact they appear to have the same runtime so I suspect it is falling back to std::sort.

EDIT: I just tried sorting other data, such as a std::vector<uint32_t>, which I would think would not require a custom sort, and it produces the exact same error.

Rather baffled as to how anyone is using this library, is this a VS2015 specific error?

I tried benchmarking ska_sort with various types(u32,float,u64) and I am not seeing any speed up, it always runs in the same time as std::sort, I believe there may be some issue with how this library works in VS2015(possibly caused by the fact that I had to comment out one of those overloads to get anything to compile).

Warnings In code

VS2015 reports 3 warnings

They are listed here:

Warning C4127 conditional expression is constant ska_sort.hpp 1119
Warning C4457 declaration of 'end' hides function parameter ska_sort.hpp 1123
Warning C4127 conditional expression is constant ska_sort.hpp 1179

Contributing to Boost.Sort

Hello. Nice work!

Do you want to contribute these algorithms to Boost.Sort? I can help with it as much as possible.

skarupke / ska_sort Goto Github PK

ska_sort's People

Contributors

Stargazers

Watchers

Forkers

ska_sort's Issues

Applying the "vacancy" half-swap trick to ska sort

Framework for custom "item comparators"

ska_sort should reject non-IEEE 754 floating points

Please add readme.md with link to description of the sort

Idea: ska_sort based `nth_element`

Make LSD radix sort faster with software write-combining

Implement almost in-place LSD radix sort

vs2015: class template has already been defined

Warnings In code

Contributing to Boost.Sort

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs