ThielSort

A competitive implementation of the dfr, written by Larry Thiel. To run, clone this repo:

cd ThielSort\dfr
make CannedExtras=July18SimdExtras timing

Then run it with

./dfrOpt -n 1000000000 -s -r Uniform
./dfrOpt -n 1000000000 -s -r Normal 1 9223370000000000000 2305840000000000000
./dfrOpt -n 1000000000 -s -r Normal 1 4294970000 1073740000

Which runs it for 1 bil vals for Uniform data, a wide normal distribution and a narrow distribution respectively. The "1" says to use a random seed (you could put it after Uniform too), or alternatively you could put a seed for the random generation.

dfr

Diverting Fast Radix, an incredibly fast algorithm for sorting fixed-digit data.

So, you want to sort things faster because you've realized that under the hood, whatever you're doing is spending way too much time sorting! If your data is even remotely fixed-digit, what we've got here should do the trick. I'll try to make it more user-friendly over time. Sorry that that codebase is absolutely awful right now... but I've been talking a big game for a few years now, and I think I gotta put my money where my mouth is.

Seeing it run in a simple case

git clone https://github.com/ramou/dfr.git
cd dfr
make timing
./perform 1000000

This'll give a ton of debug data about where diversion happened and what passes took how long, and how long it took for standard sort to do it. If you're super nice, you'll run this on standalone machines for 10² till 10⁹ and mail me the results, with maybe /proc/cpuinfo, lscpu or something cool like that. What's neat is I've seen some crazy variation from architecture to architecure that opens the door to the makefile doing a make install that squeezes out some crazy extra speed (but only on special cases).

So, how fast is it?

Check out the times we've recorded. It's a log scale, so there's a link to the raw data so you can see that.

How to use it

Until I do a proper make install (#4), just make sure you include the fr.hpp file, which contains all the real stuff.

template <typename INT, typename ELEM>
void dfr(ELEM *source, auto length)

Just call dfr as a templated function with the type INT (which must be the first field) and the type of the overall in-memory object. I currently don't support passing lambdas to get at the fixed-digit key, but #1 is about adding that eventually.

If you're sorting an array of uint64_t, called values that's of length length, you would just call

dfr<uint64_t, uint64_t>(values, length)

If you were sorting a bunch of records whose keys were uint64_t and whose payload was some other thing and it was of type ELEM (with the uint64_t key being the first field of ELEM) you would call

dfr<uint64_t, ELEM>(values, length)

If things go badly or you'd like me to make it more convenient for you in some way, I'm probably open to making concessions just to get this adopted, so feel free to throw in your two cents.

There are some constants that can yield some improvement via tweaking in the code. #5 is about me determining that during make and #4 is about setting up make install so this is even more practical.

Other folks in the game you should check out

There are two codebases I consider as important to be aware of, both MSD Radix Sorts (I'm in the LSD camp), but both critically relevant. My current codebase is competitive with RADULS2 in the 10s-100s of millions to billions range, so I'll need to squeeze out a few more technical improvements if I want to beat them without non-temporal writes. For smaller inputs (and particularly non-uniform inputs) I start coming out ahead (and will be getting even better). When I say I run faster, I mean A LOT faster. Basically, in the cases where you don't get much for the non-temporal writes, I'm massacaring because LSD Radix Sorts are better, and now I've proven they can divert too (there is an actual proof, I wish I wrote faster/better)! Ska Sort is never competitive with Diverting Fast Radix, but that's not why it's important to recognize. They do a lot of gret stuff that I can learn from. Nobody wants faster unusable code, and studying Ska Sort will make my code more usable.

Raduls

@marekkokot was friendly, fast and informative in getting back to me and gave me a real competitor to chase. I've specifically avoided non-temporal writes to show that my stuff is competitive without shitting up my code, but he's right to use it, and some day I'll convince a Master's student or someone to add that so I don't need to get my hands dirty... if my dad doesn't get to it first. But hey, I'm also open to collaborating to get it done to appropriately give credit where it is due... I also think you can apply some Fast-Radix-y things to speed up RADULS, but that's just a peripheral thought right now.

RADULS supports multithread, because as an MSD Radix Sort it can. DFR does not. I think the general purpose advantage of this is limited because of the multithread overhead. If you need to sort a million lists, sort each with a single thread and don't worry about unnecessary coordination. If you do need to sort that single list really fast, then multithread becomes way more important.

https://github.com/refresh-bio/RADULS

Ska Sort

@skarupke has a really neat implementation of a MSD Radix Sort. I think more people should give it the time. My code is easier to use than that RADULS code, but effort has been made to make Ska Sort readily usable in a broad array of cases and I aspire to systematically add the same flexibility to the use of Diverting Fast Radix. A lot of thought has been put into details that I personally haven't cared about, but which I recognize to be really relevant... I hope also to unload much of this on a Master's student as well :D

https://github.com/skarupke/ska_sort

If you use my code...

Ok, you don't have to. I gotta set the license proper, but I'm pretty down with MIT free. However, if you're using it in a publication, you can reference what I did with a very old version that isn't even half as fast (literally isn't even half as fast): https://dl.acm.org/citation.cfm?id=2938554

@inproceedings{Thiel:2016:IGL:2938503.2938554,
 author = {Thiel, Stuart and Butler, Greg and Thiel, Larry},
 title = {Improving GraphChi for Large Graph Processing: Fast Radix Sort in Pre-Processing},
 booktitle = {Proceedings of the 20th International Database Engineering \&\#38; Applications Symposium},
 series = {IDEAS '16},
 year = {2016},
 isbn = {978-1-4503-4118-9},
 location = {Montreal, QC, Canada},
 pages = {135--141},
 numpages = {7},
 url = {http://doi.acm.org/10.1145/2938503.2938554},
 doi = {10.1145/2938503.2938554},
 acmid = {2938554},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {algorithm, analytics, big data, graph processing, radix sort},
}

Yes, I'm working with my dad. Yes, it's super cool!

./perform 10000 do nothing

Hello
I tried to run your code.
I clone it, then make perform then ./perform 10000 and nothing happens.

Output of lscpu

Architecture :                          x86_64
Mode(s) opératoire(s) des processeurs : 32-bit, 64-bit
Boutisme :                              Little Endian
Processeur(s) :                         4
Liste de processeur(s) en ligne :       0-3
Thread(s) par cœur :                    1
Cœur(s) par socket :                    4
Socket(s) :                             1
Nœud(s) NUMA :                          1
Identifiant constructeur :              GenuineIntel
Famille de processeur :                 6
Modèle :                                60
Nom de modèle :                         Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
Révision :                              3
Vitesse du processeur en MHz :          3780.758
Vitesse maximale du processeur en MHz : 3900,0000
Vitesse minimale du processeur en MHz : 800,0000
BogoMIPS :                              6983.91
Virtualisation :                        VT-x
Cache L1d :                             32K
Cache L1i :                             32K
Cache L2 :                              256K
Cache L3 :                              6144K
Nœud NUMA 0 de processeur(s) :          0-3
Drapaux :                               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d

ramou / dfr Goto Github PK

dfr's Introduction

ThielSort

dfr

Seeing it run in a simple case

So, how fast is it?

How to use it

Other folks in the game you should check out

Raduls

Ska Sort

If you use my code...

dfr's People

Contributors

Stargazers

Watchers

Forkers

dfr's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs