seiflotfy / cuckoofilter Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 107.0 53 KB

Cuckoo Filter: Practically Better Than Bloom

License: MIT License

Go 100.00%

cuckoofilter's Introduction

🔭 Work: Axiom
💬 Passion: Data-Structures, DBs, Algorithms

cuckoofilter's People

Contributors

Stargazers

Watchers

Forkers

codehuntio martinpinto trichner tomzhang zoutaiqi cooljiansir phonkee zhanglei couchbasedeps pombredanne rtbsolutions etsangsplk sckelemen vpol goexperts code-ishwar lazercorn chhsiao1981 auspexeu elvismacak happy-ferret chaitanyaphalak influx6 sashibee jerry-vite bahlo emperorearth qiangkezhen golanglib revado anissac iwasaki-kenta bonedaddy iivy dllen 724686158 jared-nishikawa jiwubu neboduus valery-barysok marreck xiaoyan648 wccms admpub isgasho daniel-007 theholyjoker king526 gopherj panmari seven4x duanjunxiao mutalisk999 chengming191 yangshen1987 sarrubia jacobjohansen tonyuuuu zjs604381586 wtysos11 qys123888 a-h yusong666666 longxibendi ibytechaos olegjakushkin wjt2015 standardgalactic carlziess ajunlonglive shayan-p cyberhardening exterity vncert-cc dandyhuang 17090093103 vvvvv pkafma-aon zuiwanting newmetric luoluodaduan mystretch30 sfmqrb last9 darian-catalin-cucer iq-scm thamwangjun douglas235 jacobnguyenn max-cheng dingyuandy glim2485 crt-fork rfyiamcool haifeiwu

cuckoofilter's Issues

PR #36 broke this repo

The patch modified the .mod -file so it is now referring the module as coming from @seven4x which breaks the import.

go: github.com/seiflotfy/cuckoofilter: github.com/seiflotfy/[email protected]: parsing go.mod:
	module declares its path as: github.com/seven4x/cuckoofilter
	        but was required as: github.com/seiflotfy/cuckoofilter

How to dumps/loads cuckoofilter from file?

I have three question before import cuckoofilter into my project:

how to dumps and loads cuckoofilter from file in order to failover?
what's recommend ratio for data size/cuckoofilter size?
how to resize cuckoofilter for scale purpose？

Not a big deal, but cuckoofilter.NewCuckooFilter() is a bit repetitive for my taste -- would you be open to renaming the types/functions to New() and NewDefault() (not sure if NewDefault is really necessary either, if you want to shrink your API surface a bit) and Filter?

In fact, the package could be renamed to just cuckoo and then you'd have cuckoo.NewFilter() and cuckoo.Filter which is a little more natural.

Just my 2 cents!

Adversarial resistance

Are the cuckoo filters implemented in this library resistant to adversarial attempts to induce false positives on chosen elements as described in this paper?

Thanks for reading and thank you for this implementation!

Documenting Thread-safety

It appears that this implementation is not safe for concurrent use, which I didn't see written in the documentation. Additionally, a thread-safe implementation and additional testing would make this solution more appealing.

../pkg/mod/github.com/seiflotfy/[email protected]/util.go:17:14: invalid operation: 1 << i (shift count type int, must be unsigned integer)

Hi everyone, I am working on a project called infinicache, which uses this filter. The system is composed from a Client a Proxy and some Nodes. The client is the one which uses the filter. I modified some code on the proxy side (very little modifications, I just added a field to a structure) and is 2 days now since the cuckoo filter gives problems. Basically when I run the client file, I obtain this.

$ go run client/example/main.go
# github.com/seiflotfy/cuckoofilter
../pkg/mod/github.com/seiflotfy/[email protected]/util.go:17:14: invalid operation: 1 << i (shift count type int, must be unsigned integer)

Given that I am not a pro of Golang, I am do not understand very well which kind of error is it. For sure it talks about cuckoofilter, so for me it seems a compile error, but I am not sure about that. Basically I am not able to log anything to the screen when I run the file, so that's why I think is a compile errror. However, I didn't see any change on your master branch since 24 days, hence the library code should have not changed.

Please help me in understanding why I have this error.

Thanks in advance

warning message when get source to local

go version is "go1.5beta2 darwin/amd64"

//util.go
warning: code.google.com is shutting down; import path code.google.com/p/gofarmhash will stop working
warning: package github.com/seiflotfy/cuckoofilter
    imports code.google.com/p/gofarmhash

prefer using github gofarmhash
Thank you for makes this project.

Filter is unreliable?

I imported this package in the project, thank you!
But I found the filter unreliable. When I load about 500,000 data from the database and use the method InsertUnique to filter, I found that about 4000 data returned true.It means that the data is already repeated?
But,the database table has already made a unique primary key.And I confirm that the data is not duplicated in the database.

Question: FPP rate

How to calculate the load limit for an FPP rate of 1%?

Calculate memory footprint of cuckoo filter

Hello,
How can I calculate memory footprint of the cuckoo filter, I am creating.
I am benchmarking this implementation against bloom filter here:

What I am interested is

Time taken to insert 500,000 strings
Time taken to lookup 1.000,000 strings, out of which 500,000 are the inserted above. So FP will be checked as well
Memory footprint

So I need to know how much memory the
type CuckooFilter struct {
buckets []bucket
count uint
}
consumes.

What I am trying:
func (cf *CuckooFilter)GetSize() uintptr {
return unsafe.Sizeof(cf.count) + unsafe.Sizeof(cf.buckets) + uintptr(len(cf.buckets) * 24) + uintptr(len(cf.buckets) * 24 *4)
}

It is shown to be 15mb, may be because each slice element in golang takes 24 bytes additional.

Effective Method Naming

just an fyi, according to https://golang.org/doc/effective_go.html#Getters you can drop the Get in getters i.e. GetCount() becomes Count()

Thanks for the package!

How to reset all or clear all bits in cockoofilter

I want know how to reset cockoofilter , or clear all bit in cuckoofilter, when loop more than one loop , thanks.

Efficiency

Cuckoo filter should be memory and CPU efficient.
Using "slice" for buckets and fingerprints destroys both cause sizeof(slice)==24byte and every slice is a pointer inderection.

go build cuckoofilter have error

Golang 1.8
window system

Hello, I've just downloaded cuckoofilter and run
I get the error message:

I don't have this package "github.com/dgryski/go-metro"

Element in cuckoo filter may lose when inserting after the filter is full

The Insert method in cuckoofilter behave unexpected when the filter is full. In general, insert method should return true when it can store the element, and return false when it can't store. But when the cuckoo filter is full, insert method will store the element, and withdraw a random one inside filter while returning false.

I make a simple test about this:

func TestExhausted(t *testing.T){
	ckflter := seiflotfy_filter.NewFilter(10000)
	var cached [maxNumForBenchmark]bool
	elementList := make([]int,0)
	var lastElement int

	isFinish := false
	for !isFinish{
		randNum := rand.Intn(maxNumForBenchmark)
		for cached[randNum]{
			randNum = rand.Intn(maxNumForBenchmark)
		}

		finish := ckflter.Insert([]byte(strconv.Itoa(randNum)))
		if !finish{
			t.Logf("Last element is %v",randNum)
			lastElement = randNum
			isFinish = true
			break
		}else{
			cached[randNum] = true
			elementList = append(elementList,randNum)
		}
	}

	for i:=0;i<len(elementList);i++{
		isInside := ckflter.Lookup([]byte(strconv.Itoa(elementList[i])))
		if !isInside{
			t.Errorf("%v should inside but not",elementList[i])
		}
	}
	isInside := ckflter.Lookup([]byte(strconv.Itoa(lastElement)))
	t.Logf("%v should not inside but got %v",lastElement,isInside)
}

The output of code above is

=== RUN   TestExhausted
--- FAIL: TestExhausted (0.00s)
    standardCuckooFilter_test.go:308: Last element is 1776
    standardCuckooFilter_test.go:321: 16055 should inside but not
    standardCuckooFilter_test.go:325: 1776 should not inside but got true
FAIL

Process finished with exit code 1

My solution of this is adding a backup array for all buckets in cuckoo filter, and recover when insertion failure. But this costs a lot when insertion doesn't fail.

During the encoding process, memory usage increased by 5 times

My filter has 100 million data, and it uses about 1GB of memory. During the encoding process, the memory increases to almost 7GB. What optimization can be done for encoding and decoding to reduce the intermediate memory consumption

Misreporting without repetition？

Hello:
I do some test for this filter by golang:
example:

	cf := cuckoofilter.NewCuckooFilter(100000)
	for v := 0; v < 10000 ; v += 1 {
		res := cf.InsertUnique([]byte(string(v)))
		fmt.Println(cf.Count(), res)
	}

I generate between 0 and 10,000 elements and insertUnique to cuckoofilter, there is no repetition here. But the result is a lot of duplication and insertion failures.

Would it be better to support concurrent insert?

Hello,

If we have huge data to process, we may need a lot insert, but I didn't see any mutex in code, would it be better to support concurrent insert, or is there any other reason for not?

How to support cuckoofilter auto scalable?

there had scalable cuckoofilter in other language

https://github.com/sile/scalable_cuckoo_filter

how to do it in golang?

Why force power of 2 for capacity?

Perhaps I am not seeing it in the paper, but why force the capacity up to the next power of 2? When rebuilding as in the scalable filter, this decreases the potential load factor.

Is thread-safe concurrent by default?

Hi, thanks for the contribution.
May I have a question that InsertUnique is concurrency safe? Thanks for looking.

I had met this data race issue:

==================
WARNING: DATA RACE
Read at 0x00c0b79cac1c by goroutine 276:
  github.com/seiflotfy/cuckoofilter.(*bucket).getFingerprintIndex()
      /Users/wenwei/work/go/src/github.com/balabalabala/vendor/github.com/seiflotfy/cuckoofilter/bucket.go:33 +0x54
  github.com/seiflotfy/cuckoofilter.(*Filter).Lookup()
      /Users/wenwei/work/go/src/github.com/balabala/vendor/github.com/seiflotfy/cuckoofilter/cuckoofilter.go:37 +0xde
  github.com/seiflotfy/cuckoofilter.(*Filter).InsertUnique()
      /Users/wenwei/work/go/src/github.com/balabalavendor/github.com/seiflotfy/cuckoofilter/cuckoofilter.go:74 +0x5a

getAltIndex implementation problem

Given what I know about cuckoo filters, given a fingerprint f there should be two indices i1 and i2 such that i1 ^ f = i2.

The problem is in util.go, here:

func getAltIndex(fp byte, i uint, numBuckets uint) uint {
	hash := uint(metro.Hash64([]byte{fp}, 1337))
	return (i ^ hash) % numBuckets
}

For example, take f = 216, i = 8, and numBuckets = 100 (example taken when data = []byte("hello world"))

Then getAltIndex(216, 8, 100) = 70

One would expect that getAltIndex(216, 70, 100) would produce 8. However, this is not the case:

getAltIndex(216, 70, 100) = 36

I believe the problem is the implementation of getAltIndex. Basically, xor (^) and mod (%) do not commute, and the modulo operation needs to be done first. That is:

func getAltIndex(fp byte, i uint, numBuckets uint) uint {
	hash := (uint(metro.Hash64([]byte{fp}, 1337))) % numBuckets
	return (i % numBuckets) ^ hash
}

This produces the desired result.

Great work! Can you use xxh3?

https://github.com/seiflotfy/cuckoofilter/blob/master/util.go

This one instead of metrohash.
https://github.com/zeebo/xxh3

Will wait for your update. It's great piece of software.

By the way, what's the recommended size of NewFilter(1000000) <- what do you suggest? and roughly how much memory is taken by increasing this value?

Sorry I'm a bit dense on this cuckoo filter thing. Can you suggest a value for NewFilter?

I would like to perform matching against 16mil ip addresses.

What is a better use case for panmari 16bit cuckoo filter u mentioned? wouldnt everyone want a lower false positive match?

Decode([]byte("")) cause panic: runtime error: index out of range [3532051776] with length 0

The following cases were found during online use：

func TestService_getInstalledApps(t *testing.T) {
	c, err := cuckoo.Decode([]byte(""))
	assert.Nil(t, err)
	assert.False(t, c.Lookup([]byte("test")))
}

output:

--- FAIL: TestService_getInstalledApps (0.00s)
panic: runtime error: index out of range [3532051776] with length 0 [recovered]
	panic: runtime error: index out of range [3532051776] with length 0

goroutine 19 [running]:
testing.tRunner.func1.2({0x102e36060, 0x140000e4240})
	/usr/local/go/src/testing/testing.go:1209 +0x258
testing.tRunner.func1(0x140000fe680)
	/usr/local/go/src/testing/testing.go:1212 +0x284
panic({0x102e36060, 0x140000e4240})
	/usr/local/go/src/runtime/panic.go:1038 +0x21c

Faster insert when the filter is filled

Profiling a bit this package, I found that about 75% of the time spend for Insert is calling rand.Intn(bucketSize). This is with an almost empty filter so I expect it's getting worse as the it fill.

I expect the requirement for randomness here is fairly low (it's just drawing a number between 0 and 3 to not do the same thing each time), there should be ways to do that much faster, and especially without mutex locking as rand.Intn has.

Need to add mutex

Multiple parallel Ecode, Decode, Count and Insert hangs due to race conditions.
We should make formal mutex locks.

seiflotfy / cuckoofilter Goto Github PK

cuckoofilter's Introduction

cuckoofilter's People

Contributors

Stargazers

Watchers

Forkers

cuckoofilter's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs