Hey, just wanted to say that I really like the work you are doing here. We have found

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Yeah, try the following: <div class="snippet-clipboard-content notranslate positio

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Explore using highbits from the hash for subtable selection about parallel-hashmap HOT 19 CLOSED

greg7mdp commented on May 12, 2024

Explore using highbits from the hash for subtable selection

from parallel-hashmap.

Comments (19)

greg7mdp commented on May 12, 2024

Hi @fowles , thanks for the comment! I really enjoyed your CPPCON presentation on the swisstable, and indeed it was while thinking about it that I came up with the idea of the parallel hashmap. So thanks for that too!

I do indeed think that your suggestion is an excellent one. Originally I was wary of using the high bits for selecting the subtable, because some users with integer keys may use a trivial patthru hash function, in which case the high bits have a good chance of being all zeros (indeed that is what the default boost hash does).

However, currently in phmap, I systematically add entropy to all the bits of the hash key by multiplying said hash key with a large prime number (see phmap_mix). This has multiple advantages, especially if the hash function is not great. Also for example when hashing pointers it is not necessary to shift them a few bits right to eliminate bits that are always zero.

Because of this, I think we can be sure that the high order bits will have good entropy, and using these instead of the lower bits makes total sense. I'll test the change and provide an update here.

from parallel-hashmap.

greg7mdp commented on May 12, 2024

@fowles, did some experimentation. I couldn't hardcode h >> 56 since phmap can be built in 32 bit where sizeof(size_t) == 4.

However I tried something equivalent with my benchmark program (inserting unique random 32 bit integers) with the following definition:

static size_t subidx(size_t hashval) {
        constexpr const uint32_t shift = sizeof(size_t) * 8 - 2 * N;
        return (hashval >> shift) & mask;
}

and it was actually a bit slower than with the previous indexing ((h ^ (h >> 4)) & 0x0F), about the same as the regular flat_hash_map:

Then, I tried different shifts, and I did find a significant improvement with the following:

static size_t subidx(size_t hashval) {
        return ((hashval >> 8) ^ (hashval >> 16)) & mask;
}

I am puzzled as to why:

the base flat_hash_map is slightly slower than the parallel_flat_hash_map with the old subidx computation. After all it does more work.
and even further how the new subidx computation can possibly provide such a large improvement with already pretty random keys. Come to think of it, maybe the keys in my benchmark are not as random as I think, I use a generator of unique numbers.

In any case, I'll look into it some more, but at this point I do believe I'll change the subidx computation to use higher bits of the key as you suggested.

PS: unique number generator used in the benchmark


// --------------------------------------------------------------------------
//  from: https://github.com/preshing/RandomSequence
// --------------------------------------------------------------------------
class RSU
{
private:
    unsigned int m_index;
    unsigned int m_intermediateOffset;

    static unsigned int permuteQPR(unsigned int x)
    {
        static const unsigned int prime = 4294967291u;
        if (x >= prime)
            return x;  // The 5 integers out of range are mapped to themselves.
        unsigned int residue = ((unsigned long long) x * x) % prime;
        return (x <= prime / 2) ? residue : prime - residue;
    }

public:
    RSU(unsigned int seedBase, unsigned int seedOffset)
    {
        m_index = permuteQPR(permuteQPR(seedBase) + 0x682f0161);
        m_intermediateOffset = permuteQPR(permuteQPR(seedOffset) + 0x46790905);
    }

    unsigned int next()
    {
        return permuteQPR((permuteQPR(m_index++) + m_intermediateOffset) ^ 0x5bf03635);
    }
};

from parallel-hashmap.

fowles commented on May 12, 2024

"the new subidx computation can possibly provide such a large improvement with already pretty random keys"

even given random keys you are effectively partitioning the hashes into 16 buckets. If you imagine using h1 & 0x0F for you secondary selection, you are clearly causing each partition to have a very biased view of the hash space. The fact that you are doing something a bit more complex doesn't alter the fact that the partition is a reduction in entropy (it is just harder to write the closed form algorithm for what it is). The idea is that you want to use either

a different hash function to compute the bucket
a portion of the existing hash that you are not already using (hence functionally a different hash function).

We have found these sort of bugs are very hard to track down and are the main reason why we adding live table sampling to swisstable. https://github.com/abseil/abseil-cpp/blob/master/absl/container/internal/hashtablez_sampler.h

My 2019 cppcon talk goes into some of this in more detail, but I promise there are more obscure cases that have large performance issues that just didn't condense well into a talk.

from parallel-hashmap.

greg7mdp commented on May 12, 2024

I'll check out your 2019 cppcon talk.
In the spirit of using more of the high bits even, I tried:

((hashval >> 8) ^ (hashval >> 16) ^ (hashval >> 24)) & mask;

instead of:

((hashval >> 8) ^ (hashval >> 16)) & mask;

and the result were exactly the same. So on the principle that this adds more entropy to the subtable choice I'll keep the new 3-shift version.

Thanks again for the suggestion. Because of the mixing of the hash key I do, I think it is unlikely that you will find obscure cases with large performance issues with phmap, as long as the hashed values are mostly unique. Can you prove me wrong? :-)

from parallel-hashmap.

fowles commented on May 12, 2024

Yeah, try the following:

phset<int64> foo = // fill this with N things
std::vector<int64> bad_order(foo.begin(), foo.end());
std::vector<int64> good_order(foo.begin(), foo.end());
std::shuffle(good_order.begin(), good_order.end());
bad_order.resize(N / 4);
good_order.resize(N / 4);
phset<int64> bad(bad_order.begin(), bad_order.end());
phset<int64> good(good_order.begin(), good_order.end());

I am fairly confident that good will outperform bad for insertions and finds by a significant margin

from parallel-hashmap.

greg7mdp commented on May 12, 2024

I'll give it a try, but I'd be surprised.

from parallel-hashmap.

greg7mdp commented on May 12, 2024

Closing the issue, please reopen if needed.

from parallel-hashmap.

fowles commented on May 12, 2024

I was kinda curious what the results from my last comment were...

from parallel-hashmap.

greg7mdp commented on May 12, 2024

Ha, I didn't try it yet. I'm mildly curious as well. I'll try to look into it later today.

from parallel-hashmap.

greg7mdp commented on May 12, 2024

@fowles you were right, the shuffled list of values is significantly faster to insert than the non-shuffled one. Probably because the remaining values after the resize of the vector have more entropy (i.e fewer identical high-order bits).

Interestingly, the parallel flat_set is significantly faster than the non-parallel version for the non-shuffled keys.
I checked in the test program (matt.cc) if you want to look at it.

from parallel-hashmap.

greg7mdp commented on May 12, 2024

@fowles When increasing the number of submaps, the parallel hashmap is almost as fast for both set of keys

from parallel-hashmap.

greg7mdp commented on May 12, 2024

@fowles I am a still surprised by the result you correctly predicted. I understand that the values we insert have less entropy without the shuffle, however they are still all different, and with my flat_hash_map I always do a mixing step (multiplying the provided hash value with a large random prime using umul128), and adding the lower and higher 64 bits of the result to provide the hash value I use. I would have thought that this would be enough to increase the entropy similarly as the shuffled list. Clearly I was wrong. I am also not quite understanding why the parallel hash sets help so much?
BTW I just watched your cppcon 2019 video... it is great!

from parallel-hashmap.

greg7mdp commented on May 12, 2024

Actually the difference does not come from the entropy difference in the keys, but from the fact that the keys inserted are sorted in the slower case. I don't know why inserting sorted keys would be slower. I will check in the updated test program that re-randomises the keys after the resize and shows similar times.

from parallel-hashmap.

fowles commented on May 12, 2024

Recall that we do the iteration post mixing step and that your tables don't have per instance randomization. As such, all of the keys will have low entropy post mix, which is why you see this performance difference. I am unclear on why the parallel variant is better, but I suspect it has something to do with fully taking data from each table in turn, so you still get mixed data within that table. As a result the parallel table functions as a non-parallel table that is smaller.

from parallel-hashmap.

greg7mdp commented on May 12, 2024

I don't quite follow, and also in my last comment I mentioned that the speed difference was not from selecting the keys with lower values, but due to the insertion being done in sorted order. My last test does a post-resize shuffle and there is little speed difference anymore.

from parallel-hashmap.

fowles commented on May 12, 2024

I have no theory as to why re-randomizing returns it to the old performance. That is a super interesting finding...

from parallel-hashmap.

greg7mdp commented on May 12, 2024

I have more puzzling findings.
First, when not shuffled, the order of the keys in the vector were not sorted, but exactly in the order of iterating a flat_hash_set.
So it is slower when we insert in a flat_hash_set, in the order that the keys will finally be.
Of course there are many internal resizes, so I tried using reserve(order.size()). When doing that the insertion of non shuffled keys is super fast (understandably as we access the memory in growing addresses order). However the insertion of shuffled keys is slower than when we don't do the reserve! What the heck!

I also tried with sorted keys, using a phmap::btree_set<uint64_t> to insert the random generated keys into. In that case whether we shuffle or not doesn't seem to make a significant difference.

from parallel-hashmap.

greg7mdp commented on May 12, 2024

@fowles I looked into it some more and I understand what is happening. It indeed points to a potential flaw of the raw_hash_map design which has to do with the power of two array sizes.

the time spent during the insertion is mostly spent when resizing the table, and reinserting existing values. When adding a reserve() call the slower time disappears.
when a reserve call is added (no resizing when inserting), insertion of the non-shuffled values (which are exactly in the order of storage in the source flat_hash_map) is significantly faster. I assume because the table is populated in ascending memory order.
without the reserve() call, there is an issue with the resize, described below.

When we resize, at occupancy around 87%, we allocate an array twice the size and reinsert existing values. If the new array is divided in 4 successive quadrants, the reinserted values will go in the 1st and 3rd quadrants (because we consider an extra bit in H1, if that bit is 0 the value go to the first quadrant, and if it is one the value go to the 3rd quadrant).

So now we have a hash map with ~87% occupancy in the 1st and 3rd quadrant, and ~0% occupancy in the 2dn and 4th quadrants. This in itself might be a performance issue.

But it is compounded when the values inserted are in the 'final' flat_hash_map order, because the new values will again have a H1 value that directs them to the beginning of the 1st and 3rd quadrants, which are already 87% full, leading to many collisions before finding an empty slot.

from parallel-hashmap.

greg7mdp commented on May 12, 2024

Hum, I think I was tired when I wrote that about the quadrants above. Still need to investigate what happens exactly.

from parallel-hashmap.

Explore using highbits from the hash for subtable selection about parallel-hashmap HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs