QC Hash

Extremely fast unordered map and set library for C++20

Implementation Specialization
RawMap & RawSet
TODO

Implementation Specialization

How do you write a faster hash map than Google?

Specialize.

Hash map optimization is all about memory layout and data. The size and alignment of elements, the meta data kept, how it's all layed out in memory - all critical factors when it comes to fast and efficient design.

For this reason, one single hash map implementation will never be the most optimal for every kind of data. The ideal optimizations for string keys will not be the same as for integers, which in turn will not be the same as for huge compound types. This is evident in the stratification of existing hash map libraries, such as Facebook's Folly, which is comprised of three different implementations suited for different needs, plus one "fast" variant that just serves as a compile-time switch for the others.

We have chosen to take a similar approach. Two or three different implementations tailored to a certain family of data, plus a "generic" wrapper that picks one at compile time based on key and value types.

For the time being, only one implementation is complete, RawMap/RawSet, which specializes in small keys that are uniquely representable, a concept described below. In time, at least one additional implementation will be added that is specialized for strings and any other keys not already covered. At that time the generic wrapper will also be added.

RawMap & RawSet

The goal of qc::hash::RawMap and qc::hash::RawSet is to provide the fastest possible hash map/set implementation for keys of uniquely representable type, especially those that fit within a word.

What Is a "Uniquely Representable" Type?

A uniquely representable type has exactly one unique binary representation for each possible value of the type. The standard type trait std::has_unique_object_representations nearly defines this property, but the type need not be trivially copyable.

Common types that are uniquely representable

Signed integers
Unsigned integers
Enums
Pointers
std::unique_ptr
std::shared_ptr

Additionally, any compound type comprised exclusively of uniquely representable constituents is itself uniquely representable, e.g. std::pair<int, int>.

Common types that are NOT uniquely representable

std::string and std::string_view (value stored indirectly in heap, not in object binary)
Floating point numbers (+0 and -0 have different binary representations, also the mess that is NaN)
Any type that has virtual functions or base classes (v-table pointer may differ by object)

Which Key and Value Types are Fastest?

For keys, essentially any type that can be reinterpreted as an unsigned integer will perform best.

More exactly, any key type K that satisfies these three criteria:

sizeof(K) <= sizeof(uintmax_t)
sizeof(K) <= alignof(K)
sizeof(K) is a power of two

For values, small to medium sized types are ideal, let's say less than 64 bytes roughly.

The larger each element is, the more memory most of the map/set functions need to cover, increasing cache pressure and reducing overall speed.

Features

Single header file

Setup is super easy, simply copy and include qc-hash.hpp

Standards compliance

qc::hash::RawMap mirrors std::unordered_map
qc::hash::RawSet mirrors std::unordered_set
There are a few exceptions:
- No reference stability. Elements may be moved in memory on any rehashing operation
- No key_equal type. Due to how this implementation works, it is necessary for direct binary equality to be equivalent to key equality, thus user-provided equality functions are not allowed
- The erase(iterator) method does not return an iterator to the next element. This is a significant optimization in the general case of the next iterator not being needed. If the next iterator is needed, it can be obtained simply by incrementing the iterator to the erased element
- No bucket interface as this implementation is not bucket based

RawMap and RawSet share the same code

RawSet is simply an alias of RawMap with value type void
Compile-time checks and switches facilitate the minor differences with no run time cost

Heterogeneous lookup

Elements may be accessed using any key type compatable with the stored key type
For example, a set of std::unique_ptr<int> may be accessed using int *
The heterogeneity mechanism may be specialized for user defined types

Written in modern C++20

Takes full advantage of features such as perfect forwarding, constexpr if's, concepts, and more
Simpler template and compile-time control code for enhanced readability
Concepts and type contraints improve compiler error messages

Design

Open addressing

Elements are stored in a single, contiguous backing array
Allows lookups to happen directly, without indirection
Typically means only a single cache miss for "cold" data
Very fast for small elements, but performance does eventually drop off with increasing element size
Invalidates reference stability, see Standards Compliance

Linear probing

If a key is not found at first lookup, progresses forward through the slots until the key is found or a vacant slot is hit
Grave tokens are inserted when elements are erased, making erasure a O(1) operation

Circuity

The backing array is logically circular
When inserting an element, if it would "fall off" the end, we simply wrap-around to the beginning instead of growing
Allows for better saturation, even up to 100% if desired

Power of two capacity

Capacity starts as a power of two, and doubles when the map/set grows, thus always remaining a power of two
Allows the slot index of a key's hash to be found with a simple bitwise-and instead of an expensive integer modulo, a massive speedup
This increases the likelihood to suffer from collisions if the keys have power-of-two patterns. See the Identity hashing section for more info

No per-element meta data

Significant memory savings and performance gains
Sets this implementation apart from others
See the Meta-less Data section for an explanation

Additional Optimizations

Lazy allocation

New, empty maps/sets do not allocate memory
Backing memory is not allocated until first insertion

Load factor of 0.5

The map/set will grow upon exceeding 50% capacity
This significatly decreases the chance of collision and speeds up operations at the cost of greater memory usage

Identity hashing

The default hasher, qc::hash::IdentityHash, simply returns the lowest size_t's worth of the key
This is extremely fast for keys with decent low-order entropy
Pointers are right-shifted by the log2 of the pointee type's alignment to drop trailing zero bits
If low-order entropy is a concern, such as with keys expressing power-of-two patterns, an alteranative hasher is available. qc::hash::FastHash is very minimal, providing sufficient entropy while being as fast as possible
The user may provide their own hasher if desired

Meta-less Data

One of the biggest differentiators between hash map implementations is how they store element meta data. Some store it alongside each element, others store it in a separate array. Some store precomputed hashes, others store just a couple bits. Regardless of the strategy, meta data is necessary for a performant, fully featured hash map, right?

Wrong.

This implementation has no per-element meta data. How this works is explained below, but first, what are the repercussions?

Pros

Lowest memory usage - the only memory allocated is what is needed to store the elements at the desired load capacity
Memory locality - element operations only touch a single place in memory
Amazing insert, access, and erase performance due to better cache utilization and simpler program logic

Cons

Worse iteration performance than some other strategies
Only supports uniquely representable keys

How it works

Of all possible keys, two are considered "special". The first is the key with the binary representation 0b111...111, which is considered the "vacant" key. The second is the key with the binary representation 0b111...110, which is considered the "grave" key. If a key has any other binary representation it is considered a "normal" key.

The backing slots are split into three parts:

"Normal" slots: the largest part, these hold typical elements, grow when the map/set expands, and generally function as expected. Vacant slots contain the "vacant" special key, and grave slots contain the "grave" special key
"Special" slots: two slots that exist solely to store the two special elements should they be present. The first contains the "vacant" special key if it is vacant and the "grave" special key if it is present. The second contains the "grave" special key if it is vacant and the "vacant" special key if it is present
"Terminal" slots: two slots with key values of 0b000...000 which mark the end of the map/set

Together, the slots look something like: |N|N|N|...|N|N|N|G|E|0|0|, where N is a normal slot, G is the special slot for the grave key, E is the special slot for the vacant key, and 0 is a terminal slot

When an operation is performed with a normal key, only the normal slots are used, and things work as one would expect. When an operation is performed with a special key, only the corresponding special slot is used.

The two terminal slots remain constant over the lifetime of the map/set and exist solely to allow incrementing iterators to determine if and where they are at the end of the slots.

Benchmarks

--> Check out this very detailed spreadsheet with charts benchmarking all aspects of the implementation! <--

Benchmarks were made against the standard library implementation (MSVC) and five significant third-party libraries.

I highly recommend this writeup by Martin Ankerl on hash map benchmarks for a larger selection of libraries and much more in-depth analysis. The five comparison libraries I chose were the fastest on his list that didn't have ridiculous build requirements.

The following table shows approximately how many times faster qc::hash::RawSet is than each given library. Benchmark using 10,000,000 uint64_t elements on an x64 intel processor. See the above link for more details.

Library	Insertion	Access	Iteration	Erasure	Overall
MSVC 16.0 `std::unordered_set`	6.0x	2.9x	2.2x	6.7x	4.9x
Google Abseil's `absl::flat_hash_set`	3.0x	2.0x	0.4x	6.0x	3.2x
Martinus' `robin_hood::unordered_set`	1.7x	2.0x	0.4x	1.6x	1.5x
Skarupke's `ska::flat_hash_set`	1.7x	2.3x	0.9x	2.1x	1.8x
Tessil's `tsl::robin_set`	1.7x	2.3x	1.1x	2.6x	2.0x
Tessil's `tsl::sparse_hash_set`	5.9x	3.8x	0.4x	5.3x	4.4x

Note how this implementation is relatively slow at iteration. This is an unfortunate, if not unexpected, drawback. However the massive increase in insert, access, and erase speeds more than makes up for it.

Here is a small selection of the most notable charts from the spreadsheet:

TODO

Alternative implementation for strings and other larger/complex types
Allow user to request load capacity above 50%

daskie / qc-hash Goto Github PK

qc-hash's Introduction