Hashing. It's one of the fundamentals of Computer Science, but how do you compare the performance of a given hash function to another?
Here, you will find my attempts at solving that problem for multiple personal projects. I'm not (currently) in the business of defining new hashing algorithms, so it just focuses on hash functions built by smarter people than me, and attempts to portray their behavior and performance characteristics.
Above all, this test suite strives to be flexible, enabling new tests corpora to be integrated with minimal effort.
I hope you find this useful.
In order to gauge the fitness of a particular hash function, we
can check simple uniformity, or how evenly the algorithm
distributes output values. A hash function that always returns
the value 42
is decidedly non-uniform.
Simple uniformity must be differentiated from random uniformity. For simple uniformity, we don't care how random the distribution appears, we just care that it is even. Random uniformity has important implications for cryptographically-secure hashing functions, but for simple hash tables (my original problem space) it may not be worth the effort.
Hashing ~235k English words over 64 bins (calculated H(s) % 64
),
we can look at the scatter plot of bin clustering to get a gut
feel for how uniform each hash is:
Yeesh. xor looks terrible, but that's not unexpected. The rest all seem to cluster around the middle (which is good!), so let's look at a consolidated boxplot to get a feel for quartile distribution:
As before, xor is off the charts, and makes it difficut to see the relative differences between the other, more serious hashing algorithms. If we drop the worst three (murmur3, spooky2, and xor), the picture gets a little clearer:
We also need to evaluate each hash algorithm according to how long it take to perform hashing operations. To do that, we average the execution times of fifty runs of the hash function against each string in our training corpus. This set of per-operation durations is graphed below as a boxplot:
The y-axis is measured in nanoseconds.
Both SDBM and FNV1a have pretty dismal performance (75th percentile is no better than ~200ns/op and can get as high as ~300ns/op) compared to all the others. If we drop the worst four (fnv1a, sdbm, lookup3, and jenkins1), we can see more nuance between the top performers:
Incidentally, xor has the best timing performance. Given that it's a single assembly instruction per byte, that's not too surprising.
Before I started playing with hash functions just for the hell of it, I found myself in search of a good hashing function for bolo-ng and TSDP qualified names.
Qualified Names look like this:
host=web01.example.com, metric=cpu, value=0.001, ts=1474055418
That is, a set of key=value
pairs, separated by commas. This is
definitely not English text, so let's re-run the fitness tests
and timing benchmarks against a corpus of dummy qualified names.
(Side note: I wrote a small Perl script, madlibs, that let me easily generate hundreds of thousands of test names. :+1:)
Hashing 100k Qualified Names over 64 bins (again calculated as
H(s) % 64
), we can look at the scatter plot of bin clustering to
get a gut feel of uniformity given our new inputs:
Not much has changed; xor is still terrible. Let's look at the box plots:
We knew xor was bad, but now we can also see that murmur2 is worse than the others, so let's drop those two out and hopefully get a better picture of the top performers:
With the longer strings in the Qualified Names corpus, let's see what happens with the performance. Again, we're going to average the execution times of fifty runs of the hash function against each string in the corpus. This set of per-operation durations is graphed below as a boxplot:
The y-axis is meaured in nanoseconds.
It seems quite clear that we have three "bands" of performance; murmur3 and xor are the fastest, followed by djb2, k&r, sdbm and spooky2 in the mid-range, with fnv1a, jenkins1 and lookup3 bringing up the rear as the slowest.
Let's focus on the middle tier:
djb2 is the clear winner with the least variation and the quickest median per-hash execution speed at a little over 200ns.
But why settle for the middle-tier when we have two candidates for super-ultra-fast:
xor sure is fast. But so is murmur3, clocking in at a little under 40ns, 5x faster than djb2.
For my purposes, djb2 and murmur3 are the two best options. Two best options? Isn't that a cop-out? Not really. I'll probably be going with djb2 when I need uniformity, and murmur3 for raw speed of hashing.