GithubHelp home page GithubHelp logo

Comments (14)

cipriancraciun avatar cipriancraciun commented on April 27, 2024 3

(The statements bellow take into account that you use only a single 64bit hash function.)

Granted that by using an AES based hashing algorithm, the chance of a collision given exactly two entries is 2^63 (not 64 due to the "birthday effect").

However quoting from Birthday problem on Wikipedia article (the probability table section), it would take only 200 million different keys to get a 0.1% chance for a collision (these keys don't necessarily have to be all stored at the same time).


However by adding a second 64 bits hash function, you're basically using now a "compound" 128 bits hash, which means that from a practical point of view you can just do the following:

  • instead of using AES to generate the hash, use SHA (one of the SHA functions in the family that are hardware accelerated and provides at least 192 bits);
  • use the first 64 bits of the hash as you are doing now to establish a "slot";
  • use the remaining 128 bits to store besides the the actual data;

I think will provide a much better collision "margin", and will increase the current storage requirements by only 16 bytes.

from ristretto.

dgryski avatar dgryski commented on April 27, 2024 3

For small keys, minio's implementation has a lot of overhead. From the readme: Note that, because of the scheduling overhead, for small messages (< 1 MB) you will be better off using the regular SHA256 hashing. I have found this to be the case in my benchmarks also.

from ristretto.

minhaj-shakeel avatar minhaj-shakeel commented on April 27, 2024 3

Github issues have been deprecated.
This issue has been moved to discuss. You can follow the conversation there and also subscribe to updates by changing your notification preferences.

drawing

from ristretto.

andersfylling avatar andersfylling commented on April 27, 2024 1

This also means the metrics from the benchmarks might be affected by collisions giving false positives when it comes to High hit ratio.

from ristretto.

manishrjain avatar manishrjain commented on April 27, 2024

Yes. We decided to use uint64 knowing that that leaves us open to collisions to avoid paying for significant memory overhead with large keys.

The idea is that, if this becomes a problem, we'll add another hashing technique (or allow a way to do so in general), which we can use to ascertain if the key is correct or not. If there's a collision, we immediately evict the key from the cache. Also, we're just going to assume that the chances of a second hash (different algorithm) colliding is too low to be handled.

from ristretto.

manishrjain avatar manishrjain commented on April 27, 2024

AESHash is supported by go runtime, and it does 64 bytes key in 5ns on my laptop. If we can find an SHA implementation which can give us this kind of performance, we can quickly switch.

from ristretto.

cipriancraciun avatar cipriancraciun commented on April 27, 2024

[...] we can quickly switch.

But are you considering doing this "switch"? (I.e. introducing a way to check that Get returns the data for the "actual" key? Where "actual" means with a high degree of probability.)

Regarding the 192 bit hash function, one could always use the same aeshash function with three different "keys" ("seeds"), which practically would increase the hash time by a factor of 3.

from ristretto.

manishrjain avatar manishrjain commented on April 27, 2024

I think doing it twice should be enough (128 bits), using the second one for detecting conflict.

from ristretto.

cipriancraciun avatar cipriancraciun commented on April 27, 2024

I'm no mathematician / cryptographer but for me, given the following two conditions, it should do the trick:

  • use the mentioned 128 bit hash (split in the two 64 bit hashes);
  • provided that one doesn't keep the values for "too much";

I would define "too much" as:

  • given a probability of 1e-18 (which Wikipedia states it is the uncorrectable bit error rate for HDD's),
  • one needs 2.6e10 different keys to reach a collision (with the previous chosen probability),
  • which if generated at a rate of ~10K per second,
  • should take around 28 days;

I.e. my conclusion (to be on the safe side) is that one shouldn't keep cached data more than a week, and definitively it should be entirely flushed once a month.

from ristretto.

6ecuk avatar 6ecuk commented on April 27, 2024

AESHash is supported by go runtime, and it does 64 bytes key in 5ns on my laptop. If we can find an SHA implementation which can give us this kind of performance, we can quickly switch.

@manishrjain
Hi, maybe look on https://github.com/minio/sha256-simd

from ristretto.

manishrjain avatar manishrjain commented on April 27, 2024

Yeah, minio sha256 looks useful.

from ristretto.

karlmcguire avatar karlmcguire commented on April 27, 2024

Fixed in #88.

from ristretto.

templexxx avatar templexxx commented on April 27, 2024

Yes. We decided to use uint64 knowing that that leaves us open to collisions to avoid paying for significant memory overhead with large keys.

The idea is that, if this becomes a problem, we'll add another hashing technique (or allow a way to do so in general), which we can use to ascertain if the key is correct or not. If there's a collision, we immediately evict the key from the cache. Also, we're just going to assume that the chances of a second hash (different algorithm) colliding is too low to be handled.

I think what you want is Cuckoo hashing

from ristretto.

martinmr avatar martinmr commented on April 27, 2024

@karlmcguire Hey. Is this fixed? Your comment says this was fixed by #88 but you closed and reopened this immediately. If it's not fixed, what else should be done to close this ticket?

from ristretto.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.