GithubHelp home page GithubHelp logo

cuckoofilter's Introduction

Cuckoo Filter

Overview

Cuckoo filter is a Bloom filter replacement for approximated set-membership queries. While Bloom filters are well-known space-efficient data structures to serve queries like "if item x is in a set?", they do not support deletion. Their variances to enable deletion (like counting Bloom filters) usually require much more space.

Cuckoo filters provide the flexibility to add and remove items dynamically. A cuckoo filter is based on cuckoo hashing (and therefore named as cuckoo filter). It is essentially a cuckoo hash table storing each key's fingerprint. Cuckoo hash tables can be highly compact, thus a cuckoo filter could use less space than conventional Bloom filters, for applications that require low false positive rates (< 3%).

For details about the algorithm and citations please use:

"Cuckoo Filter: Practically Better Than Bloom" in proceedings of ACM CoNEXT 2014 by Bin Fan, Dave Andersen and Michael Kaminsky

API

A cuckoo filter supports following operations:

  • Add(item): insert an item to the filter
  • Contain(item): return if item is already in the filter. Note that this method may return false positive results like Bloom filters
  • Delete(item): delete the given item from the filter. Note that to use this method, it must be ensured that this item is in the filter (e.g., based on records on external storage); otherwise, a false item may be deleted.
  • Size(): return the total number of items currently in the filter
  • SizeInBytes(): return the filter size in bytes

Here is a simple example in C++ for the basic usage of cuckoo filter. More examples can be found in example/ directory.

// Create a cuckoo filter where each item is of type size_t and
// use 12 bits for each item, with capacity of total_items
CuckooFilter<size_t, 12> filter(total_items);
// Insert item 12 to this cuckoo filter
filter.Add(12);
// Check if previously inserted items are in the filter
assert(filter.Contain(12) == cuckoofilter::Ok);

Repository structure

  • src/: the C++ header and implementation of cuckoo filter
  • example/test.cc: an example of using cuckoo filter
  • benchmarks/: Some benchmarks of speed, space used, and false positive rate

Build

This libray depends on openssl library. Note that on MacOS 10.12, the header files of openssl are not available by default. It may require to install openssl and pass the path to lib and include directories to gcc, for example:

$ brew install openssl
# Replace 1.0.2j with the actual version of the openssl installed
$ export LDFLAGS="-L/usr/local/Cellar/openssl/1.0.2j/lib"
$ export CFLAGS="-I/usr/local/Cellar/openssl/1.0.2j/include"

To build the example (example/test.cc):

$ make test

To build the benchmarks:

$ cd benchmarks
$ make

Install

To install the cuckoofilter library:

$ make install

By default, the header files will be placed in /usr/local/include/cuckoofilter and the static library at /usr/local/lib/cuckoofilter.a.

Contributing

Contributions via GitHub pull requests are welcome. Please keep the code style guided by Google C++ style. One can use clang-format with our provided .clang-format in this repository to enforce the style.

Authors

cuckoofilter's People

Contributors

amallia avatar apc999 avatar arnib avatar binfan999 avatar dave-andersen avatar dnbaker avatar florianjacob avatar giang-nghg avatar jbapple-cloudera avatar mkaminsky avatar mkuraj avatar onaips avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cuckoofilter's Issues

adding string to cuckoo filter

I'm trying to add string to the cuckoofilter library
https://github.com/efficient/cuckoofilter/blob/master/src/cuckoofilter.h

mycode

cuckoofilter::CuckooFilter<string, 12> cuckoo_mempool(total_items);

but every time I run the code error will appear on line
[https://github.com/efficient/cuckoofilter/blob/master/src/cuckoofilter.h#L68]

error: no match for call to ‘(const cuckoofilter::TwoIndependentMultiplyShift) (const std::_cxx11::basic_string&)’
68 | const uint64_t hash = hasher
(item);
| ^~~~

Please get rid of SuperFastHash

First, the license on the external superfasthash code I can find isn't GPL, it's a BSD-ish one. But if we're not using it, I'd rather not have it in there, and even hinting that there's GPLed code in the file is twitchy. I suggest that we rip out the compiled hashutil file and separate the original hash functions from other people into their constituent parts, each preserving their own original license unless we reimplement them on our own. I'm happy to do this if we want.

Alternately, we can just rip out everything except our own and TwoIndependentMultiplyShift. I think I'd rather do this.

Problem in Adding char*

I am reading an input file which contains a string in each line. I am creating the filter like this
CuckooFilter<char*, 16> filter(total_items);
I can add items, but I cannot find the existing items.
It is working fine if I use static array e.g. char[31], but it does not work well for char * or string.

Test code doesn't seem to work for total_items < 4.

diff --git a/example/test.cc b/example/test.cc
index df869ec..473d7c1 100644
--- a/example/test.cc
+++ b/example/test.cc
@@ -9,7 +9,7 @@
 using cuckoofilter::CuckooFilter;
 
 int main(int argc, char **argv) {
-  size_t total_items = 1000000;
+  size_t total_items = atoi(argv[1]);
 
   // Create a cuckoo filter where each item is of type size_t and
   // use 12 bits for each item:

For total_items=1,2,3, 'test' results into segfault, works for other values.

$ ./test 1
Segmentation fault (core dumped)
$ ./test 2
Segmentation fault (core dumped)
$ ./test 3
Segmentation fault (core dumped)
$ ./test 4
false positive rate is 0%
$ ./test 100000
false positive rate is 0.135%

PackedTable with bits_per_item 5,7,9 does not work

Hi,

Thanks for writing this library for cuckoo filter. I tried to do some benchmarking and found that while setting bits per item to 5 or 7 or 9 with Packedtable config - the filter does not work. It ends up saying that filter does not contain the item that was added (false negatives). Is this known issue ? The issue can be reproduced by using the test.cc file in the repo and changing line 52 to configure the filter to PackedTable and bits per item to 5 or 7 or 9.

Broken build: missing openssl include?

At commit 3785fab

$ make -j
g++ --std=c++11 -fno-strict-aliasing -Wall -c -I. -I./include -I/usr/include/ -I./src/ -g -ggdb src/hashutil.cc -o src/hashutil.o
src/hashutil.cc: In function ‘EVP_MD_CTX* cuckoofilter::EVP_MD_CTX_new()’:
src/hashutil.cc:719:44: error: ‘OPENSSL_zalloc’ was not declared in this scope
    return OPENSSL_zalloc(sizeof(EVP_MD_CTX));
                                            ^
Makefile:27: recipe for target 'src/hashutil.o' failed
make: *** [src/hashutil.o] Error 1

@florianjacob

The possible mistake about the unit of the speed in benchmark

In line 59 in the file "benchmarks\conext-table3.cc", the speed is caculated as ”result.speed = (inserted / time) / (1000 * 1000)". Since the unit of time is nanoseconds, the unit of the speed should be "thound keys/sec", while what the code prints is "million keys/sec". Is it a typo?
image
image

Potential bug in SingleTable

I get a crash in this line.

bytes_per_bucket is 6 so trying to read a uint64_t off of the last bucket raises SEGFAULT. I need to add a note that I am using a changed version of your code s.t. all bucket data are in one long 1D array of char, instead of your 2D approach.

So the workaround I did is adding 8 bytes to what I allocate for that 1D array. Just posting here for the record.

Using SipHash ?

First of all, please excuse my possibly ignorant question as I'm not a CS guy - more of an engineer, and so I don't fully understand your academic paper.

I see that the class can be instantiated with any of the hashing functions and that by default you use a TwoIndependentMultiplyShift which appears to be an extremely efficient Universal Hashing algorithm.

What do you think of using SipHash? Do you think it would have a positive or negative impact on the effectiveness of the Cuckoo filter, its performance and overall properties?

Tests fail for all tested values other than 12 (Single) 13 (Packed)

I decided to test bits per item besides 12 (single) and 13 (packed) (number used in example/test.cc, number suggested for packed table in example/test.cc).

I replaced the assert for .Contains with the following after the assertion failed:

    // Check if previously inserted items are in the filter, expected
    // true for all items
    size_t n_found(0);
    for (size_t i = 0; i < num_inserted; i++) {
        n_found += (filter.Contain(i) == cuckoofilter::Ok);
    }

I then added the following output below:

// Output the measured true positive rate
    std::cout << "true positive rate is "
              << 100.0 * n_found / total_items
              << "%\n";
    // Outputs 0% true positive for Single Table [6,10,11,13,14,15,24]
    // Outputs 0% true positive for Packed Table [10,11,12,14,15]

I added a check to filter_.Add, and it's returning cuckoofilter::Ok every time. The problem seems to be in testing.

Sorry for the trouble!

[Question]false positive rate

Hello everyone, i am doing some research on the performance of the bloom filter, cuckoo filter and simd-block etc. And when i tried to compare the performance i found that the false positive rate of the cuckoo filter is not mentioned. So I would like to ask how much is the false positive rate of the cuckoo filter, how is the fp rate defined?

BFS for Insert path

We should think about bringing in the BFS-based insert from libcuckoo. It permits more parallel memory accesses & work on the insert path and might give us a nice speed boost.

Barring that, I wonder if it's helpful to try to check the first two buckets in a more explicitly parallel fashion. Might not be, but we can pretty quickly determine if they're full.

Does the misjudgment rate equation work?

while Expected number of inserts = 2000000,using Mur_mur3_128 hash likes Guava's BloomFilter and default fpp=0.01,your solution looks like doesn't work expectedly. Can you tell us how you compute the equation in Utils.getBitsPerItemForFpRate method.
code :

@Test
public void testInsert() {

    CuckooFilter filter = new CuckooFilter.Builder(Funnels.integerFunnel(), 2000000).withHashAlgorithm(Algorithm.Murmur3_128).build();
    //BloomFilter filter = BloomFilter.create(Funnels.integerFunnel(),2000000,0.01);
    for (int i = 6000000; i < 8000000; i++) {
        filter.put(i);
    }
    Stopwatch stopwatch = Stopwatch.createStarted();
    int count = 0;
    for (int i = 0; i < 1000000; i++) {
        count += filter.mightContain(i) ? 1 : 0;
    }
    System.out.println("misjudgment num: " + count + " use:" + stopwatch.elapsed(TimeUnit.MILLISECONDS));
}

result:
misjudgment num: 29242 use:480

Aboun “delete item”

I see your note that deleting item is sure that the deleted item must be the filter.
May I ask if the principle of a cuckoo filter requires the deleted element to be included in the filter when deleting elements?

__m256i/ _mm256_or_si256/ _mm256_testc_si256

There comes an error in compiling your codes. It seems to be occurred by immintrin.h file without the definition of __m256i, _mm256_or_si256, _mm256_testc_si256. I found the file in my system, Intel(R) Core(TM) i3-3240 CPU @3.40GHz, centos6.9, gcc4.8.2, it really does not have these. It’s truly confusing me what the problem is. Here are the error messages:
messages.zip

Mind if I rewrite to use SIPhash and speed-optimize?

Hi, Bin - do you mind if I rewrite the internals to use SIPhash as the default hash function, and add a few speed optimizations and a benchmark? I'd like to try to make this even more attractive for people to use out of the box.

Problem in Adding __int128

I am creating the filter like this :
CuckooFilter<__int128, 15> filter(total_items);
Every "fingerprint" is 15 bits.
But when I insert items to this filter, it can only contain 2 items.

That must be something wrong, I need help!
Thanks.

[Question]how to test speed of "Contain" by -O3 -march=native?

I use the code to test the speed of looking up alien items, but the result is very high. Is it optimized by the compiler? How should I do ?
start = chrono::steady_clock::now(); for (size_t i = 0; i < num_inserted; ++i) { filter.Contain(aliens[i]); } end = chrono::steady_clock::now(); find_time = end - start;

image

Cuckoo filter constructors overestimate insert success probability

Making trees to hold one item per 12.5 bits often fails. For instance, the constructor makes a filter of size 3 mebibytes when asked to construct a filter to hold 2,013,265 items, but this is too ambitious and often fails. In my experience 16,106,127 is even more error-prone as a constructor argument if I expect to actually call Add() that many times.

I suspect the limit on Add()'s kickout rate should not be constant. "An Analysis of Random-Walk Cuckoo Hashing" shows a polylogairthmic upper bound on the length of kickouts in a different but related cuckoo hash table.

[Question] Python Bindings

Hello,

I would like to ask if anybody implemented python bindings for cuckoofilter?

There are available libraries but they are only python implementations and not bindings which are not as fast and as efficient as the one implemented in C or C++

Cheers!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.