bits-and-blooms / bloom Goto Github PK
View Code? Open in Web Editor NEWGo package implementing Bloom filters, used by Milvus and Beego.
License: BSD 2-Clause "Simplified" License
Go package implementing Bloom filters, used by Milvus and Beego.
License: BSD 2-Clause "Simplified" License
With version 3.0.1 I'm seeing TestHashBasic
and TestHashRandom
fail on s390x:
--- FAIL: TestHashBasic (0.00s)
murmur_test.go:29: Backward compatibillity break.
murmur_test.go:29: Backward compatibillity break.
[...]
--- FAIL: TestHashRandom (0.02s)
murmur_test.go:60: Backward compatibillity break.
murmur_test.go:60: Backward compatibillity break.
[...]
FAIL
exit status 1
FAIL github.com/willf/bloom 0.217s
You can see the full output in https://koji.fedoraproject.org/koji/taskinfo?taskID=73367600 (look at build.log).
The link to reference in pkg.go.dev should be https://pkg.go.dev/github.com/bits-and-blooms/bloom/v3 other than https://pkg.go.dev/github.com/bits-and-blooms/bloom . The latter one is the doc for v2, the former one is for v3
if not,i should use lock/unlock to use Add method?thanks for your answer
When I type in
go get github.com/willf/bloom
I get the result
package bitset: unrecognized import path "bitset"
I see that you wrote the package bit set, however, the way you import it doesn't work with go get. I'm not sure if a bug, but something to consider.
Edit: I found your other package. Unfortunately, there is a change before it works. I believe the Sum function requires the []byte to be passed in.
Hello,
when trying to add dependency:
github.com/bits-and-blooms/bloom/v3 v3.0.0
I get
go: github.com/bits-and-blooms/bloom/[email protected]: go.mod has non-.../v3 module path "github.com/bits-and-blooms/bloom" (and .../v3/go.mod does not exist) at revision v3.0.0
I found that the /v3
suffix has been added to master, but is still missing on v3.0.0
tag.
this is spam
Hi,
I couldn't figure out any other way to contact you about this. What license are you releasing this code under?
Thanks!
as title, the release is 3.1.0
, but ..
https://github.com/bits-and-blooms/bloom/blob/master/VERSION
the bitset is
type BitSet struct { length uint set []uint64 }
now.
cause the bloomFilter used more memory
I noticed that for computing multiple hashes, you make use of the work of Less Hashing, Same Performance, which is calculated by: gi(x) = h1(x)+ih2(x) . For that you generate a 256 bits long hash partitioned into 4 uint64
s. So I wonder why you decided to generate a 256 long hash partitioned into 4 uint64
s. instead of 128 long hash partitioned into 2 uint64
s 🤔 Wouldn't it be the same regarding hashing, but with a performance improvement?
Greetings,
We've encountered a strange false positive case with a filter provisioned with single capacity and 1:billion error rate. If we increase the filter capacity to two or higher the false positive disappears, or if we keep the capacity at one and use a different error rate (e.g. 1:million) the false positive disappears.
A test to illustrate against the latest release v3.6.0:
func TestBitsAndBloom(t *testing.T) {
assert := assert.New(t)
// Capacity = 1 with 1:million error rate - passes
filter := bitsandbloom.NewWithEstimates(1, 0.00000001)
filter.AddString("d39bafed5f63e8e85c4bbcf171e3a7bb4f8542dc4237cd94d0e0ff7c08e5c685")
assert.True(filter.TestString("d39bafed5f63e8e85c4bbcf171e3a7bb4f8542dc4237cd94d0e0ff7c08e5c685"), "Should exist") // Passes
assert.False(filter.TestString("8854e4a0a9b78632e8de91ffb973336a7a4f1b9a89520149ae311416de42d162"), "Should not exist") // Passes
// Capacity = 2 with 1:billion error rate - passes
filter = bitsandbloom.NewWithEstimates(2, 0.000000001)
filter.AddString("d39bafed5f63e8e85c4bbcf171e3a7bb4f8542dc4237cd94d0e0ff7c08e5c685")
assert.True(filter.TestString("d39bafed5f63e8e85c4bbcf171e3a7bb4f8542dc4237cd94d0e0ff7c08e5c685"), "Should exist") // Passes
assert.False(filter.TestString("8854e4a0a9b78632e8de91ffb973336a7a4f1b9a89520149ae311416de42d162"), "Should not exist") // Passes
// Capacity = 1 with 1:billion error rate - fails
filter = bitsandbloom.NewWithEstimates(1, 0.000000001)
filter.AddString("d39bafed5f63e8e85c4bbcf171e3a7bb4f8542dc4237cd94d0e0ff7c08e5c685")
assert.True(filter.TestString("d39bafed5f63e8e85c4bbcf171e3a7bb4f8542dc4237cd94d0e0ff7c08e5c685"), "Should exist") // Passes
assert.False(filter.TestString("8854e4a0a9b78632e8de91ffb973336a7a4f1b9a89520149ae311416de42d162"), "Should not exist") // Fails
}
We appreciate a single item capacity filter isn't the intended use case of a bloom filter (an edge case on our side), but the behaviour is curious. If not considered a bug, could readme guidance be included on any minimum capacity to achieve the required error rate?
Thanks,
-Dotjim
I am not writing Go code anymore. Would someone like to take over as maintainer? @lemire ?
filter := bloom.NewWithEstimates(1000000, 0.01)
for a cdn deployment with 100 million items, what's a good number for the bloom item count and false positive rate?
also how much memory is needed for this bloom filter?
i was thinking if it's using a lot of memory, would it be advisable to use roaring bitmaps as "replacement" to bloom filter as suggested by chatgpt here:
RoaringBitmap/roaring#388
go get -u github.com/willf/bloom
NOTICE:
go: downloading github.com/willf/bloom v1.0.0
go: downloading github.com/willf/bloom v2.0.3+incompatible
go: downloading github.com/willf/bitset v1.2.2
go: downloading github.com/spaolacci/murmur3 v1.1.0
go get: github.com/willf/bitset@none updating to
github.com/willf/[email protected]: parsing go.mod:
module declares its path as: github.com/bits-and-blooms/bitset
but was required as: github.com/willf/bitset
Hey,
I have a question if it is possible to hash strings not depending on the order of strings? I mean
hash("black cat") = hash("cat black")
this generates false negatives because changing order of sentence results in totally different hash. Is there any option for it? Maybe splitting sentences and combining hashed strings?
Thank you for any response.
Is this matter of licensing issues or are there technical reasons to use murmur3 over xxHash?
I wrote a simple code to test the FP%, and found that FP% was good (~2%) on small sets, but suddenly began increasing at some scale, and reached 100% very quickly.
func data(i uint) []byte {
return []byte(fmt.Sprintf("%x", md5.Sum([]byte(string(i)))))
}
func falsePositive(n, b, k uint) float64 {
f := bloom.New(uint(n * b), k)
for i := uint(0); i < n; i++ {
f.Add(data(i))
}
fp, rounds := 0, 10000
for i := 0; i < rounds; i++ {
if f.Test(data(uint(i) + n)) {
fp++
}
}
return float64(fp)/float64(rounds)
}
func main() {
fmt.Println(falsePositive(1000, 8, 4))
fmt.Println(falsePositive(10000, 8, 4))
fmt.Println(falsePositive(100000, 8, 4))
fmt.Println(falsePositive(1000000, 8, 4))
fmt.Println(falsePositive(1104000, 8, 4))
fmt.Println(falsePositive(1108000, 8, 4))
fmt.Println(falsePositive(1112000, 8, 4))
fmt.Println(falsePositive(1116000, 8, 4))
}
On macbook pro 64-bit, the output was:
0.0222
0.0237
0.0213
0.0225
0.0241
0.4028
0.7934
1
My project needs a filter over tens of million items, but the filter fails at scale of ~1 million due to 100% false positives.
n := uint(1000)
filter := bloom.New(20*n, 5) // load of 20, 5 keys
filter.Add([]byte("Love")
What does it mean by load of 20,5 keys and why are we multiplying by 1000 ? would love an explanation. thanks
Hello!
I would like to know if it's possible to make this library compatible with Guava (the Java library by Google).
When I try to load a Bloom Filter using Guava, it seems it can't parse the exported file correctly:
java.io.IOException: Unable to deserialize BloomFilter from InputStream. strategyOrdinal: 0 numHashFunctions: 0 dataLength: 797
at com.google.common.hash.BloomFilter.readFrom(BloomFilter.java:636)
Guava is the most used library for Bloom Filter in Java, and certain use cases require its use in my Java project, alongside other microservices that were written in Go.
Thanks!
At line 77 EstimateFalsePositiveRate returns the number of false positives found divided by 100. This makes no sense as the number of tests performed depends on the input number. Choosing a higher n
will now increase the number of false positives by a too high amount.
The correct return value would be the number of false positives found depending on the number of tests performed:
fp_rate = float64(fp) / float64(rounds) * float64(100)
Hi Will, is one hash really enough for a bloom filter?
Using one is efficient, but if fnv(A) and fnv(B) collide then surely the multiplication you're doing will simply spread the collision across all the buckets in the same way? That doesn't provide any net benefit to avoiding false-positives by increasing the number of hashes.
Non-fnv-collisions (same bucket chosen from a different hash) do benefit still though, and there will be more of those.
I'm interested in the rationale. Thanks for sharing your implementation too, I'm grateful.
Cheers,
Bruce
I'll be working on this repo and the underlying bitset project on Dec 2 as part of GitHub's open source Friday efforts.
If you have suggestions, please put them in comments below.
Could you possibly set another release tag? v1 is missing some of the useful String functions and JSON marshaling which I'd like to use with the security of gopkg.
Thanks!
when I importing bloom from master branch go get -u github.com/bits-and-blooms/bloom@master
error appear that bitset package has invalid go.mod module github.com/willf/bitset
Hey there!
I belong to an open source security research community, and a member (@akincibor) has found an issue, but doesn’t know the best way to disclose it.
If not a hassle, might you kindly add a SECURITY.md
file with an email, or another contact method? GitHub recommends this best practice to ensure security issues are responsibly disclosed, and it would serve as a simple instruction for security researchers in the future.
Thank you for your consideration, and I look forward to hearing from you!
(cc @huntr-helper)
This is happing because we are type casting bloom filter size estimation to uint32 and its rounding numbers back to within uint32 range,
we should prefer to typecast it to uint64 so that it can derive correct result
WARNING: DATA RACE
Read at 0x0000002f40e8 by goroutine 8:
github.com/willf/bloom.fnv64Hash()
github.com/willf/bloom/_test/_obj_test/bloom.go:86 +0x134
github.com/willf/bloom.baseHashes()
github.com/willf/bloom/_test/_obj_test/bloom.go:95 +0x92
github.com/willf/bloom.(*BloomFilter).Test()
github.com/willf/bloom/_test/_obj_test/bloom.go:192 +0x92
github.com/willf/bloom.TestConcurrent.func2()
/Users/will/code/go/src/github.com/willf/bloom/bloom_test.go:43 +0x83
s@binary/encoding@encoding/binary@
is there a way to save results to disk when reboot?
hi,
Is this library support persistence feature ?
such as :
filter := bloom.New(20*n, 5)
filter.save(file_name);
or
filter.load(file_name);
thanks.
Hi Will,
Really appreciate your time implementing Bloom. We have opted to use it in a new project of ours at work, and while reviewing I have found a problem with your custom implementation of FNV. I threw together a test in a branch of my fork to help illustrate. You can find that here: https://github.com/turgon/bloom-1/blob/uniformity/uniformity_test.go
The arrangement of the test is pretty straightforward. The hashing method under test is used to determine the location of a bit to set in the filter, just like it is during your Add() method. I count up the number of times each bit is hit, and then compare the histogram to the uniform distribution using a Chi Square goodness of fit test.
I tested five different methods of finding bit locations:
The results indicate that your two custom implementations of FNV are not uniformly distributing mod m. In practice one could size up the filter in m and k and not suffer terribly, however, it certainly will increase the false positive rate in very unpredictable ways.
I think the problem stems from your addition of the index on line 77.
That addition makes the initial value differ from FNV for indices i greater than zero. My suspicion is that fnv64Hash(0, data) is ok, but the others are not. Also, the others (1, 2, 3) have no influence on the bit location when ii is zero on line 99. I believe this causes the first bit location to be uniformly distributed, but not the rest.
After further research I introduced Go's native FNV-1 and FNV-1a methods as a baseline, and although they performed surprisingly well, there are only 32 and 64 bit implementations. Therefore the g(x) = h0(x) + i*h1(x) trick can only be applied for filters with m < 2^32 when splitting 64-bit FNV.
Cassandra uses Murmur3 instead of FNV for its bloom filters and the linked article hints at why. It satisfies the requirement of Kirsch and Mitzenmacher,
Our k hash functions will be of the form gi(x) = h1(x)+ih2(x) mod p, where h1(x) and h2(x) are two independent, uniform random hash functions on the universe with range {0, 1, 2, . . . , p − 1}, and throughout we assume that i ranges from 0 to k − 1.
I.e.: Uniformity is critical or the math doesn't hold up.
I also prefer Murmur3 as it produces 128-bit values, which lends itself well to addressing 64-bit locations, while not suffering the kind of performance hit a cryptographic hash takes.
I hope this was clear enough, and you can find it useful. Thank you!
After some confusion I realized EstimateFalsePositiveRate only does 10000 rounds, which is not enough to accurately estimate probability below 0.5% or so. I did the following, but maybe the number of rounds (here, x) should be a parameter of the function so the user can configure it.
const x = 10000000
for i := uint32(0); i < uint32(x); i++ {
binary.BigEndian.PutUint32(n1, i+uint32(n)+1)
if f.Test(n1) {
fp++
}
}
fp_rate = float64(fp) / float64(x)
Thanks for the library, by the way, it's awesome!
Currently there is no elegant solution to export only the bloom filter bitset. That makes it difficult to engineer custom serialization formats.
I'm eager to submit a PR to change that if we could agree on the design.
Of the top of my head we could do one of those things:
BloomFilter.b
– the simplest solution. However will allow developers to shoot themselves in the foot.BloomFilter.b
. *BloomFilter.BisetCopy
?*BloomFilter.WriteBitSetTo()
function?*BloomFilter.BitSetWriter()
function that would return atype BitSetWriter interface {
WriterTo
}
I use bloom
with cache to check whether entry is exist or not before access to cache.
above situation, to reduce false positive over the long time, i need to refresh bloom filter periodically.
Is possible adding to Clear
method which set bit to zero? because i want to clear partial bits which are sampled not all.
bloom.NewWithEstimates(0, 1.0e-6).TestString("boom")
Some of the tests are faulty. TestDirect20_5 for example. If we use the formula from estimateParameters (which is the same as the formula on wikipedia). We get that for 10000 entries and a 0.0001% precision (p = 0.000001 because EstimateFalsePositiveRate
(after fix #20) returns a 100-0 range and estimateParameters
uses a 1-0 range) the hash should have 287,552 bits and 20 hashing functions. The values used in the test (200,000 bits and only 5 hashing functions) there for result in a much higher false positive chance of 0.03%.
Even with the #20 fix the tests still pass but this is only because the Fnv hashing functions used results in a really weird false positive rate for different inputs. Some inputs result in a much lower false positive rate (which I still don't understand) and many other result in a much higher false positive rate than expected.
For testing I quickly hacked a version which uses the Sha256 hashing function and the research from Less Hashing, Same Performance: Building a Better Bloom Filter (also used in my redis bloom filter). This version always results in the correct false positive rate but is around 26 times slower. If you are interested I can make a fork with this code so you can see if it's interesting.
The link in your README to the murmerhash algorithm goes to a dead link.
Example Address List.
0af80f98e511577e5c87d56069896c70fe4a6263
0af81021340c23bb20429e47c3bdd62137c322d3
0af811994346cdca9d6dbe8e7da4f3393130fc71
0af81250856d377cf7f6914ebe196a9570ada313
0af814b46c4ef5c57dbc53ae03c812eea37d4aac
0af816bb57c0adac8bca0401d295724cd80b0956
64dfb83dcca1f516d2bac93c8c9e4eb6513fd364
64dfb8c23802f3aa67c9078c0e7970fa17584892
64dfb8c5d21743ba3dd8cef5b576e1ddf6d4c239
64dfb931e9d68aa68842efa3c5d0bc3c1101b5ec
64dfba34a39823c273a33fadf49112c063ec3724
It gives quite a lot of positive and false results and for this reason I cannot use it.
I want to use bloom filter to filter about 25 million hash160 addresses.
How can I use the settings more efficiently?
I want to use it for this project.
https://github.com/oguzdelioglu/odelbrain
It seems that Go's behavior when casting Positive Infinity to uint32 is architecture specific. This causes theApproximatedSize
calculation to return an inconsistent value between architectures.
If you run this:
saturatedBloom := bloom.FromWithM(bitset.New(10000).Complement().Bytes(), 10000, 4)
fmt.Println(saturatedBloom.ApproximatedSize())
In amd64, it prints 0
, while if you run it on arm64, it prints the max value for uint32
.
For a smaller example:
fmt.Println(uint32(math.Inf(1)))
Is equivalent.
I'd expect a saturated bloom to return a very high number and not zero, so I'd propose checking first for a saturated bloom and returning max uint32 explicitly.
See: golang/go#56023
I'm creating bloom filter from a list of strings with NewWithEstimates
. I pass double size of the list as argument n
. If the list is empty, n
is 0.
n := 0
f := bloom.NewWithEstimates(n * 2, 0.01)
f.Get([]byte("test"))
panic: runtime error: integer divide by zero
I can add a check for n == 0
and put another value instead of 0, however it would be nice if the module itself did this check:
diff --git a/bloom.go b/bloom.go
index e19ba10..b7f74c7 100644
--- a/bloom.go
+++ b/bloom.go
@@ -72,6 +72,9 @@ type BloomFilter struct {
// New creates a new Bloom filter with _m_ bits and _k_ hashing functions
func New(m uint, k uint) *BloomFilter {
+ if m == 0 {
+ m = 1
+ }
return &BloomFilter{m, k, bitset.New(m)}
}
On big-endian architecture bytes of data unpacked (1, 2) to uint64 as big-endian resulting in wrong hash value. Data bytes must be unpacked as little-endian for correct hash calculation.
Possible way to fix that issue:
diff --git a/murmur.go b/murmur.go
index c8947e1..c93b1ba 100644
--- a/murmur.go
+++ b/murmur.go
@@ -39,6 +39,7 @@ SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
package bloom
import (
+ "encoding/binary"
"math/bits"
"unsafe"
)
@@ -59,8 +60,8 @@ type digest128 struct {
func (d *digest128) bmix(p []byte) {
nblocks := len(p) / block_size
for i := 0; i < nblocks; i++ {
- t := (*[2]uint64)(unsafe.Pointer(&p[i*block_size]))
- k1, k2 := t[0], t[1]
+ b := (*[16]byte)(unsafe.Pointer(&p[i*block_size]))
+ k1, k2 := binary.LittleEndian.Uint64(b[:8]), binary.LittleEndian.Uint64(b[8:])
d.bmix_words(k1, k2)
}
}
@@ -269,8 +270,8 @@ func (d *digest128) sum256(data []byte) (hash1, hash2, hash3, hash4 uint64) {
// we do not want to append to an actual array!!!
if tail_length+1 == block_size {
// We are left with no tail!!!
- word1 := *(*uint64)(unsafe.Pointer(&tail[0]))
- word2 := uint64(*(*uint32)(unsafe.Pointer(&tail[8])))
+ word1 := binary.LittleEndian.Uint64(tail[:8])
+ word2 := uint64(binary.LittleEndian.Uint32(tail[8:8+4]))
word2 = word2 | (uint64(tail[12]) << 32) | (uint64(tail[13]) << 40) | (uint64(tail[14]) << 48)
// We append 1.
word2 = word2 | (uint64(1) << 56)
When upgrading to latest version we observe this error:
panic: runtime error: index out of range [14977] with length 14977
goroutine 761 [running]:
github.com/bits-and-blooms/bitset.(*BitSet).ReadFrom(0xc00036ff00, {0x50e95a0?, 0xc0003b8dc0?})
github.com/bits-and-blooms/[email protected]/bitset.go:955 +0x392
github.com/bits-and-blooms/bloom/v3.(*BloomFilter).ReadFrom(0xc001096a50, {0x50e95a0, 0xc0003b8dc0})
github.com/bits-and-blooms/bloom/[email protected]/bloom.go:367 +0xd6
github.com/minio/minio/cmd.(*dataUpdateTracker).deserialize(0xc000eb4850, {0x50e95a0, 0xc0003b8dc0}, {0x1b?, 0xc001344f00?, 0x0?})
github.com/minio/minio/cmd/data-update-tracker.go:434 +0x487
github.com/minio/minio/cmd.(*dataUpdateTracker).load(0xc000eb4850, {0x50fa338, 0xc000ff7280}, {0xc001254700?, 0x64, 0xc000128c80?})
[...]
We are de-serializing into a new instance of &bloom.BloomFilter{}
.
Possible regression from bits-and-blooms/bitset#103
What is max number of elements in bloom filter? If it will be 200-300 millions - its ok?
// Equivalent to calling Test(data) then Add(data). Returns the result of Test.
func (f *BloomFilter) TestAndAdd(data []byte) bool {
f.locations(data)
for i := uint(0); i < f.k; i++ {
if !f.b.Test(f.locBuff[i]) {
f.present = false
}
f.b.Set(f.locBuff[i])
}
return f.present
}
needs f.present = true initialization
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.