GithubHelp home page GithubHelp logo

ramou / dfr Goto Github PK

View Code? Open in Web Editor NEW
3.0 4.0 2.0 354 KB

Diverting Fast Radix, an incredibly fast algorithm for sorting fixed-digit data.

License: MIT License

C++ 95.40% Shell 0.01% Makefile 0.78% C 3.81%

dfr's Introduction

ThielSort

A competitive implementation of the dfr, written by Larry Thiel. To run, clone this repo:

cd ThielSort\dfr
make CannedExtras=July18SimdExtras timing

Then run it with

./dfrOpt -n 1000000000 -s -r Uniform
./dfrOpt -n 1000000000 -s -r Normal 1 9223370000000000000 2305840000000000000
./dfrOpt -n 1000000000 -s -r Normal 1 4294970000 1073740000

Which runs it for 1 bil vals for Uniform data, a wide normal distribution and a narrow distribution respectively. The "1" says to use a random seed (you could put it after Uniform too), or alternatively you could put a seed for the random generation.

dfr

Diverting Fast Radix, an incredibly fast algorithm for sorting fixed-digit data.

So, you want to sort things faster because you've realized that under the hood, whatever you're doing is spending way too much time sorting! If your data is even remotely fixed-digit, what we've got here should do the trick. I'll try to make it more user-friendly over time. Sorry that that codebase is absolutely awful right now... but I've been talking a big game for a few years now, and I think I gotta put my money where my mouth is.

Seeing it run in a simple case

git clone https://github.com/ramou/dfr.git
cd dfr
make timing
./perform 1000000

This'll give a ton of debug data about where diversion happened and what passes took how long, and how long it took for standard sort to do it. If you're super nice, you'll run this on standalone machines for 102 till 109 and mail me the results, with maybe /proc/cpuinfo, lscpu or something cool like that. What's neat is I've seen some crazy variation from architecture to architecure that opens the door to the makefile doing a make install that squeezes out some crazy extra speed (but only on special cases).

So, how fast is it?

Check out the times we've recorded. It's a log scale, so there's a link to the raw data so you can see that.

How to use it

Until I do a proper make install (#4), just make sure you include the fr.hpp file, which contains all the real stuff.

template <typename INT, typename ELEM>
void dfr(ELEM *source, auto length)

Just call dfr as a templated function with the type INT (which must be the first field) and the type of the overall in-memory object. I currently don't support passing lambdas to get at the fixed-digit key, but #1 is about adding that eventually.

If you're sorting an array of uint64_t, called values that's of length length, you would just call

dfr<uint64_t, uint64_t>(values, length)

If you were sorting a bunch of records whose keys were uint64_t and whose payload was some other thing and it was of type ELEM (with the uint64_t key being the first field of ELEM) you would call

dfr<uint64_t, ELEM>(values, length)

If things go badly or you'd like me to make it more convenient for you in some way, I'm probably open to making concessions just to get this adopted, so feel free to throw in your two cents.

There are some constants that can yield some improvement via tweaking in the code. #5 is about me determining that during make and #4 is about setting up make install so this is even more practical.

Other folks in the game you should check out

There are two codebases I consider as important to be aware of, both MSD Radix Sorts (I'm in the LSD camp), but both critically relevant. My current codebase is competitive with RADULS2 in the 10s-100s of millions to billions range, so I'll need to squeeze out a few more technical improvements if I want to beat them without non-temporal writes. For smaller inputs (and particularly non-uniform inputs) I start coming out ahead (and will be getting even better). When I say I run faster, I mean A LOT faster. Basically, in the cases where you don't get much for the non-temporal writes, I'm massacaring because LSD Radix Sorts are better, and now I've proven they can divert too (there is an actual proof, I wish I wrote faster/better)! Ska Sort is never competitive with Diverting Fast Radix, but that's not why it's important to recognize. They do a lot of gret stuff that I can learn from. Nobody wants faster unusable code, and studying Ska Sort will make my code more usable.

Raduls

@marekkokot was friendly, fast and informative in getting back to me and gave me a real competitor to chase. I've specifically avoided non-temporal writes to show that my stuff is competitive without shitting up my code, but he's right to use it, and some day I'll convince a Master's student or someone to add that so I don't need to get my hands dirty... if my dad doesn't get to it first. But hey, I'm also open to collaborating to get it done to appropriately give credit where it is due... I also think you can apply some Fast-Radix-y things to speed up RADULS, but that's just a peripheral thought right now.

RADULS supports multithread, because as an MSD Radix Sort it can. DFR does not. I think the general purpose advantage of this is limited because of the multithread overhead. If you need to sort a million lists, sort each with a single thread and don't worry about unnecessary coordination. If you do need to sort that single list really fast, then multithread becomes way more important.

https://github.com/refresh-bio/RADULS

Ska Sort

@skarupke has a really neat implementation of a MSD Radix Sort. I think more people should give it the time. My code is easier to use than that RADULS code, but effort has been made to make Ska Sort readily usable in a broad array of cases and I aspire to systematically add the same flexibility to the use of Diverting Fast Radix. A lot of thought has been put into details that I personally haven't cared about, but which I recognize to be really relevant... I hope also to unload much of this on a Master's student as well :D

https://github.com/skarupke/ska_sort

If you use my code...

Ok, you don't have to. I gotta set the license proper, but I'm pretty down with MIT free. However, if you're using it in a publication, you can reference what I did with a very old version that isn't even half as fast (literally isn't even half as fast): https://dl.acm.org/citation.cfm?id=2938554

@inproceedings{Thiel:2016:IGL:2938503.2938554,
 author = {Thiel, Stuart and Butler, Greg and Thiel, Larry},
 title = {Improving GraphChi for Large Graph Processing: Fast Radix Sort in Pre-Processing},
 booktitle = {Proceedings of the 20th International Database Engineering \&\#38; Applications Symposium},
 series = {IDEAS '16},
 year = {2016},
 isbn = {978-1-4503-4118-9},
 location = {Montreal, QC, Canada},
 pages = {135--141},
 numpages = {7},
 url = {http://doi.acm.org/10.1145/2938503.2938554},
 doi = {10.1145/2938503.2938554},
 acmid = {2938554},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {algorithm, analytics, big data, graph processing, radix sort},
} 

Yes, I'm working with my dad. Yes, it's super cool!

dfr's People

Contributors

ramou avatar larry-thiel avatar

Stargazers

Baz avatar  avatar Carlos Aragonés avatar

Watchers

James Cloos avatar  avatar  avatar IM avatar

Forkers

larry-thiel

dfr's Issues

Running make should adjust constants to the current system

There are a bunch of constants. When to diver to where and such. These things can be trivially determined in our makefile and set to what is appropriate for the current system. We can go even further than that, but at a minimum we should be considering that.

Add proper sampling for estimating counts

We know how to do it. We've diced up the problem, done a bunch of the proofs and we've even written out how we'll do it. Let's do it and squeeze out that extra boost in performance when someone's sorting data that isn't uniformly distributed. It ain't that hard and we're already paying the cost for start/end counts because we knew we'd do this eventually.

That said, when I do this is it worth having an explicit uniform distribution version that doesn't do the start/end to squeeze out that tiny improvement from replacing a memory lookup with some basic arithmetic? I'll decide when I actually do this ticket.

Support passing lambdas to dfr for pulling out the fixed-digit key.

What I'll do is use up yet more memory to convert the array of pointers to an array of ELEM whose first part is the fixed-digit key and the second part is the original pointer. I'll then dfr that like normal, and finish by copying the pointers back in the sorted order.

This is an effort to make it easier for people who want to do that without just making them change how they store their data, which is a bit uncool for me to do.

Diversion to Ska Sort instead of std:sort

Looking at the performance data, I realize that I could squeeze out some good performance by using Ska Sort for buckets that are too big to get my insertion sort, but not big enough that it would be profitable to LSD the rest of it. Ska Sort shines at those ranges.

Have makefile figure out whether we're Big Endian or Little Endian

We do a thing that is a nice improvement that I'll attribute to Nils Pipenbrinck, who I haven't found on GitHub, but who is on twitter (torusle). He did it nicely and sold me on the idea, but it makes use of reinterpret_cast and you really need to know the endian nature or it'll be all sorts of awful.

I've also heard it suggested that this check be performed on the data at run-time. I don't know what to make of that, but I may make a way to call it with the endian nature explicit and let the make set the default to what it finds native for unsigned ints.

Bug in Insertion Sort causing intermittent failure to sore

It looks like when I added an optimization some time back I made a horrible assumption. I believe that if that IS implementation runs on a sequence that has a smallest value after DT (but before length) then it'll get inserted before the beginning of the list and we're hosed.

Nothing like a year+ perspective to make flaws obvious.

./perform 10000 do nothing

Hello
I tried to run your code.
I clone it, then make perform then ./perform 10000 and nothing happens.

Sélection_010

Output of lscpu

Architecture :                          x86_64
Mode(s) opératoire(s) des processeurs : 32-bit, 64-bit
Boutisme :                              Little Endian
Processeur(s) :                         4
Liste de processeur(s) en ligne :       0-3
Thread(s) par cœur :                    1
Cœur(s) par socket :                    4
Socket(s) :                             1
Nœud(s) NUMA :                          1
Identifiant constructeur :              GenuineIntel
Famille de processeur :                 6
Modèle :                                60
Nom de modèle :                         Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
Révision :                              3
Vitesse du processeur en MHz :          3780.758
Vitesse maximale du processeur en MHz : 3900,0000
Vitesse minimale du processeur en MHz : 800,0000
BogoMIPS :                              6983.91
Virtualisation :                        VT-x
Cache L1d :                             32K
Cache L1i :                             32K
Cache L2 :                              256K
Cache L3 :                              6144K
Nœud NUMA 0 de processeur(s) :          0-3
Drapaux :                               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts md_clear flush_l1d

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.