Comments (14)
(The statements bellow take into account that you use only a single 64bit hash function.)
Granted that by using an AES based hashing algorithm, the chance of a collision given exactly two entries is 2^63 (not 64 due to the "birthday effect").
However quoting from Birthday problem on Wikipedia article (the probability table section), it would take only 200 million different keys to get a 0.1% chance for a collision (these keys don't necessarily have to be all stored at the same time).
However by adding a second 64 bits hash function, you're basically using now a "compound" 128 bits hash, which means that from a practical point of view you can just do the following:
- instead of using AES to generate the hash, use SHA (one of the SHA functions in the family that are hardware accelerated and provides at least 192 bits);
- use the first 64 bits of the hash as you are doing now to establish a "slot";
- use the remaining 128 bits to store besides the the actual data;
I think will provide a much better collision "margin", and will increase the current storage requirements by only 16 bytes.
from ristretto.
For small keys, minio's implementation has a lot of overhead. From the readme: Note that, because of the scheduling overhead, for small messages (< 1 MB) you will be better off using the regular SHA256 hashing
. I have found this to be the case in my benchmarks also.
from ristretto.
Github issues have been deprecated.
This issue has been moved to discuss. You can follow the conversation there and also subscribe to updates by changing your notification preferences.
from ristretto.
This also means the metrics from the benchmarks might be affected by collisions giving false positives when it comes to High hit ratio.
from ristretto.
Yes. We decided to use uint64 knowing that that leaves us open to collisions to avoid paying for significant memory overhead with large keys.
The idea is that, if this becomes a problem, we'll add another hashing technique (or allow a way to do so in general), which we can use to ascertain if the key is correct or not. If there's a collision, we immediately evict the key from the cache. Also, we're just going to assume that the chances of a second hash (different algorithm) colliding is too low to be handled.
from ristretto.
AESHash is supported by go runtime, and it does 64 bytes key in 5ns on my laptop. If we can find an SHA implementation which can give us this kind of performance, we can quickly switch.
from ristretto.
[...] we can quickly switch.
But are you considering doing this "switch"? (I.e. introducing a way to check that Get
returns the data for the "actual" key? Where "actual" means with a high degree of probability.)
Regarding the 192 bit hash function, one could always use the same aeshash
function with three different "keys" ("seeds"), which practically would increase the hash time by a factor of 3.
from ristretto.
I think doing it twice should be enough (128 bits), using the second one for detecting conflict.
from ristretto.
I'm no mathematician / cryptographer but for me, given the following two conditions, it should do the trick:
- use the mentioned 128 bit hash (split in the two 64 bit hashes);
- provided that one doesn't keep the values for "too much";
I would define "too much" as:
- given a probability of 1e-18 (which Wikipedia states it is the uncorrectable bit error rate for HDD's),
- one needs 2.6e10 different keys to reach a collision (with the previous chosen probability),
- which if generated at a rate of ~10K per second,
- should take around 28 days;
I.e. my conclusion (to be on the safe side) is that one shouldn't keep cached data more than a week, and definitively it should be entirely flushed once a month.
from ristretto.
AESHash is supported by go runtime, and it does 64 bytes key in 5ns on my laptop. If we can find an SHA implementation which can give us this kind of performance, we can quickly switch.
@manishrjain
Hi, maybe look on https://github.com/minio/sha256-simd
from ristretto.
Yeah, minio sha256 looks useful.
from ristretto.
Fixed in #88.
from ristretto.
Yes. We decided to use uint64 knowing that that leaves us open to collisions to avoid paying for significant memory overhead with large keys.
The idea is that, if this becomes a problem, we'll add another hashing technique (or allow a way to do so in general), which we can use to ascertain if the key is correct or not. If there's a collision, we immediately evict the key from the cache. Also, we're just going to assume that the chances of a second hash (different algorithm) colliding is too low to be handled.
I think what you want is Cuckoo hashing
from ristretto.
@karlmcguire Hey. Is this fixed? Your comment says this was fixed by #88 but you closed and reopened this immediately. If it's not fixed, what else should be done to close this ticket?
from ristretto.
Related Issues (20)
- [QUESTION]: Differentiate between TTL and cost based eviction
- [FEATURE]: Support for Type Parameters, aka, generics HOT 1
- [BUG]: ristretto used about 2 x MaxCost memory?? HOT 5
- benchmark about gc or plan to reduce gc HOT 1
- [BUG]: Expiration map grows without limit HOT 6
- [QUESTION]: Why did you choose ristretto as the project name ? HOT 2
- [FEATURE]: Loading or Compute If key is not present
- How many bits tinyLFU uses to count access frequency? HOT 1
- [QUESTION]: Ristretto hit ratio benchmark result HOT 2
- [BUG]: <sync.Pool GC>
- [QUESTION]: Too high missed rate?
- [FEATURE]: Jemalloc 5.3.0
- [FEATURE]: use system jemalloc
- [Non-hashed key values in Item]: onEvict should send the non hashed key values HOT 1
- [BUG]: compilation reports asm error when -buildmode=plugin
- [QUESTION]: Is there any way to implement an expired map counter with ristretto?
- [QUESTION]: How can I pre warm the cache ?
- [QUESTION]: When get v.0.1.2 tag and it's release note?
- [QUESTION]: Why is Ristretto so slow and has such a small hit ratio?
- Update projects using ristretto
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ristretto.