GithubHelp home page GithubHelp logo

cuckoofilter's Introduction

  • 🔭 Work: Axiom
  • 💬 Passion: Data-Structures, DBs, Algorithms

cuckoofilter's People

Contributors

a-h avatar chessman avatar codehuntio avatar glaslos avatar isites avatar jerry-vite avatar jnishikawa-carbonblack avatar marreck avatar martinpinto avatar mholt avatar panmari avatar sckelemen avatar seiflotfy avatar shabbyrobe avatar trichner avatar virrages avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cuckoofilter's Issues

PR #36 broke this repo

The patch modified the .mod -file so it is now referring the module as coming from @seven4x which breaks the import.

go: github.com/seiflotfy/cuckoofilter: github.com/seiflotfy/[email protected]: parsing go.mod:
	module declares its path as: github.com/seven4x/cuckoofilter
	        but was required as: github.com/seiflotfy/cuckoofilter

How to dumps/loads cuckoofilter from file?

I have three question before import cuckoofilter into my project:

  1. how to dumps and loads cuckoofilter from file in order to failover?
  2. what's recommend ratio for data size/cuckoofilter size?
  3. how to resize cuckoofilter for scale purpose?

Reduce repetition in naming

Not a big deal, but cuckoofilter.NewCuckooFilter() is a bit repetitive for my taste -- would you be open to renaming the types/functions to New() and NewDefault() (not sure if NewDefault is really necessary either, if you want to shrink your API surface a bit) and Filter?

In fact, the package could be renamed to just cuckoo and then you'd have cuckoo.NewFilter() and cuckoo.Filter which is a little more natural.

Just my 2 cents!

Adversarial resistance

Are the cuckoo filters implemented in this library resistant to adversarial attempts to induce false positives on chosen elements as described in this paper?

Thanks for reading and thank you for this implementation!

Documenting Thread-safety

It appears that this implementation is not safe for concurrent use, which I didn't see written in the documentation. Additionally, a thread-safe implementation and additional testing would make this solution more appealing.

../pkg/mod/github.com/seiflotfy/[email protected]/util.go:17:14: invalid operation: 1 << i (shift count type int, must be unsigned integer)

Hi everyone, I am working on a project called infinicache, which uses this filter. The system is composed from a Client a Proxy and some Nodes. The client is the one which uses the filter. I modified some code on the proxy side (very little modifications, I just added a field to a structure) and is 2 days now since the cuckoo filter gives problems. Basically when I run the client file, I obtain this.

$ go run client/example/main.go
# github.com/seiflotfy/cuckoofilter
../pkg/mod/github.com/seiflotfy/[email protected]/util.go:17:14: invalid operation: 1 << i (shift count type int, must be unsigned integer)

Given that I am not a pro of Golang, I am do not understand very well which kind of error is it. For sure it talks about cuckoofilter, so for me it seems a compile error, but I am not sure about that. Basically I am not able to log anything to the screen when I run the file, so that's why I think is a compile errror. However, I didn't see any change on your master branch since 24 days, hence the library code should have not changed.

Please help me in understanding why I have this error.

Thanks in advance

warning message when get source to local

go version is "go1.5beta2 darwin/amd64"

//util.go
warning: code.google.com is shutting down; import path code.google.com/p/gofarmhash will stop working
warning: package github.com/seiflotfy/cuckoofilter
    imports code.google.com/p/gofarmhash

prefer using github gofarmhash
Thank you for makes this project.

Filter is unreliable?

I imported this package in the project, thank you!
But I found the filter unreliable. When I load about 500,000 data from the database and use the method InsertUnique to filter, I found that about 4000 data returned true.It means that the data is already repeated?
But,the database table has already made a unique primary key.And I confirm that the data is not duplicated in the database.

Calculate memory footprint of cuckoo filter

Hello,
How can I calculate memory footprint of the cuckoo filter, I am creating.
I am benchmarking this implementation against bloom filter here:

What I am interested is

  1. Time taken to insert 500,000 strings
  2. Time taken to lookup 1.000,000 strings, out of which 500,000 are the inserted above. So FP will be checked as well
  3. Memory footprint

So I need to know how much memory the
type CuckooFilter struct {
buckets []bucket
count uint
}

consumes.

What I am trying:
func (cf *CuckooFilter)GetSize() uintptr {
return unsafe.Sizeof(cf.count) + unsafe.Sizeof(cf.buckets) + uintptr(len(cf.buckets) * 24) + uintptr(len(cf.buckets) * 24 *4)
}

It is shown to be 15mb, may be because each slice element in golang takes 24 bytes additional.

Efficiency

Cuckoo filter should be memory and CPU efficient.
Using "slice" for buckets and fingerprints destroys both cause sizeof(slice)==24byte and every slice is a pointer inderection.

go build cuckoofilter have error

Golang 1.8
window system

Hello, I've just downloaded cuckoofilter and run
I get the error message:
image
I don't have this package "github.com/dgryski/go-metro"

Element in cuckoo filter may lose when inserting after the filter is full

The Insert method in cuckoofilter behave unexpected when the filter is full. In general, insert method should return true when it can store the element, and return false when it can't store. But when the cuckoo filter is full, insert method will store the element, and withdraw a random one inside filter while returning false.

I make a simple test about this:

func TestExhausted(t *testing.T){
	ckflter := seiflotfy_filter.NewFilter(10000)
	var cached [maxNumForBenchmark]bool
	elementList := make([]int,0)
	var lastElement int

	isFinish := false
	for !isFinish{
		randNum := rand.Intn(maxNumForBenchmark)
		for cached[randNum]{
			randNum = rand.Intn(maxNumForBenchmark)
		}

		finish := ckflter.Insert([]byte(strconv.Itoa(randNum)))
		if !finish{
			t.Logf("Last element is %v",randNum)
			lastElement = randNum
			isFinish = true
			break
		}else{
			cached[randNum] = true
			elementList = append(elementList,randNum)
		}
	}

	for i:=0;i<len(elementList);i++{
		isInside := ckflter.Lookup([]byte(strconv.Itoa(elementList[i])))
		if !isInside{
			t.Errorf("%v should inside but not",elementList[i])
		}
	}
	isInside := ckflter.Lookup([]byte(strconv.Itoa(lastElement)))
	t.Logf("%v should not inside but got %v",lastElement,isInside)
}

The output of code above is

=== RUN   TestExhausted
--- FAIL: TestExhausted (0.00s)
    standardCuckooFilter_test.go:308: Last element is 1776
    standardCuckooFilter_test.go:321: 16055 should inside but not
    standardCuckooFilter_test.go:325: 1776 should not inside but got true
FAIL

Process finished with exit code 1

My solution of this is adding a backup array for all buckets in cuckoo filter, and recover when insertion failure. But this costs a lot when insertion doesn't fail.

Misreporting without repetition?

Hello:
I do some test for this filter by golang:
example:

	cf := cuckoofilter.NewCuckooFilter(100000)
	for v := 0; v < 10000 ; v += 1 {
		res := cf.InsertUnique([]byte(string(v)))
		fmt.Println(cf.Count(), res)
	}

I generate between 0 and 10,000 elements and insertUnique to cuckoofilter, there is no repetition here. But the result is a lot of duplication and insertion failures.

Why force power of 2 for capacity?

Perhaps I am not seeing it in the paper, but why force the capacity up to the next power of 2? When rebuilding as in the scalable filter, this decreases the potential load factor.

Is thread-safe concurrent by default?

Hi, thanks for the contribution.
May I have a question that InsertUnique is concurrency safe? Thanks for looking.

I had met this data race issue:

==================
WARNING: DATA RACE
Read at 0x00c0b79cac1c by goroutine 276:
  github.com/seiflotfy/cuckoofilter.(*bucket).getFingerprintIndex()
      /Users/wenwei/work/go/src/github.com/balabalabala/vendor/github.com/seiflotfy/cuckoofilter/bucket.go:33 +0x54
  github.com/seiflotfy/cuckoofilter.(*Filter).Lookup()
      /Users/wenwei/work/go/src/github.com/balabala/vendor/github.com/seiflotfy/cuckoofilter/cuckoofilter.go:37 +0xde
  github.com/seiflotfy/cuckoofilter.(*Filter).InsertUnique()
      /Users/wenwei/work/go/src/github.com/balabalavendor/github.com/seiflotfy/cuckoofilter/cuckoofilter.go:74 +0x5a

getAltIndex implementation problem

Given what I know about cuckoo filters, given a fingerprint f there should be two indices i1 and i2 such that i1 ^ f = i2.

The problem is in util.go, here:

func getAltIndex(fp byte, i uint, numBuckets uint) uint {
	hash := uint(metro.Hash64([]byte{fp}, 1337))
	return (i ^ hash) % numBuckets
}

For example, take f = 216, i = 8, and numBuckets = 100 (example taken when data = []byte("hello world"))

Then getAltIndex(216, 8, 100) = 70

One would expect that getAltIndex(216, 70, 100) would produce 8. However, this is not the case:

getAltIndex(216, 70, 100) = 36

I believe the problem is the implementation of getAltIndex. Basically, xor (^) and mod (%) do not commute, and the modulo operation needs to be done first. That is:

func getAltIndex(fp byte, i uint, numBuckets uint) uint {
	hash := (uint(metro.Hash64([]byte{fp}, 1337))) % numBuckets
	return (i % numBuckets) ^ hash
}

This produces the desired result.

Great work! Can you use xxh3?

https://github.com/seiflotfy/cuckoofilter/blob/master/util.go

This one instead of metrohash.
https://github.com/zeebo/xxh3

Will wait for your update. It's great piece of software.

  1. By the way, what's the recommended size of NewFilter(1000000) <- what do you suggest? and roughly how much memory is taken by increasing this value?

Sorry I'm a bit dense on this cuckoo filter thing. Can you suggest a value for NewFilter?

I would like to perform matching against 16mil ip addresses.

  1. What is a better use case for panmari 16bit cuckoo filter u mentioned? wouldnt everyone want a lower false positive match?

Decode([]byte("")) cause panic: runtime error: index out of range [3532051776] with length 0

The following cases were found during online use:

func TestService_getInstalledApps(t *testing.T) {
	c, err := cuckoo.Decode([]byte(""))
	assert.Nil(t, err)
	assert.False(t, c.Lookup([]byte("test")))
}

output:

--- FAIL: TestService_getInstalledApps (0.00s)
panic: runtime error: index out of range [3532051776] with length 0 [recovered]
	panic: runtime error: index out of range [3532051776] with length 0

goroutine 19 [running]:
testing.tRunner.func1.2({0x102e36060, 0x140000e4240})
	/usr/local/go/src/testing/testing.go:1209 +0x258
testing.tRunner.func1(0x140000fe680)
	/usr/local/go/src/testing/testing.go:1212 +0x284
panic({0x102e36060, 0x140000e4240})
	/usr/local/go/src/runtime/panic.go:1038 +0x21c

Faster insert when the filter is filled

Profiling a bit this package, I found that about 75% of the time spend for Insert is calling rand.Intn(bucketSize). This is with an almost empty filter so I expect it's getting worse as the it fill.

I expect the requirement for randomness here is fairly low (it's just drawing a number between 0 and 3 to not do the same thing each time), there should be ways to do that much faster, and especially without mutex locking as rand.Intn has.

Need to add mutex

Multiple parallel Ecode, Decode, Count and Insert hangs due to race conditions.
We should make formal mutex locks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.