GithubHelp home page GithubHelp logo

Export the filter about libbf HOT 9 OPEN

mavam avatar mavam commented on July 23, 2024
Export the filter

from libbf.

Comments (9)

mavam avatar mavam commented on July 23, 2024

Hi Roy,

thanks for dropping a note. An ability to serialize the Bloom filters is becoming increasingly requested. Unfortunately, it doesn't suffice to simply expose the bit vector internals. You also need to remember what hash function and what seed you have been used to fully reconstruct the Bloom filter. In a topic branch I will add the proposed N3980 hash_append functionality. This will give not only users a choice among other hash function implementations, but also improve hashing of custom types.

While doing so, I plan to improve support for serialization, which is currently lacking. It will very likely via free functions, however, and not member functions. I'd like to stay API-compatible with Boost Serialization:

template <class Processor>
void serialize(Processor& proc, basic_bloom_filter& bf) {
  // serialize members here
}

That said, I'm currently pretty backed up with other projects and cannot promise you when this will be available.

from libbf.

RoyBellingan avatar RoyBellingan commented on July 23, 2024

Thank you for the response.
I have tried to read the N3980 but is way out of my knowledge.

From what I have seen if you use always the same random number in the initialization such portability is doable... remaining on the same machine.
If you think is a possibility I`ll do test between different machine.

BTW I`ll try to read again the doc after some sleep!

from libbf.

caetanosauer avatar caetanosauer commented on July 23, 2024

This would be a very useful feature of this excellent library. Any news on that?

Thanks!

from libbf.

mavam avatar mavam commented on July 23, 2024

Unfortunately I'm lacking the cycles to pursue this myself at the moment, but I'm happy to supervise contributions.

from libbf.

amallia avatar amallia commented on July 23, 2024

@mavam do we want to use Boost::serialization here?
If not, lets clarify if we want binary serialization or human-readable serialization with operator>> & operation<< override.

from libbf.

mavam avatar mavam commented on July 23, 2024

@mavam do we want to use Boost::serialization here?

Adding a Boost dependency just for serialization would be overkill. I would like to keep the dependencies as minimal as possible: CMake plus a C++11 compiler. Actually, I think we can bump the requirement to C++14, since most compilers have a solid implementation by now. C++17 would be fun, but it's too cutting edge and we don't really reap the benefits in this library.

If not, lets clarify if we want binary serialization or human-readable serialization with operator>> & operation<< override.

Using shift operators is indeed the most common model:

std::istream is;
bloom_filter bf;
is >> bf; // throws exception on failure?

As hinted in the comment, the error handling is a bit awkward. So, let's take one step back and think about an introspection framework that we can then use to generate those overloads where needed. We have something really neat in CAF: http://actor-framework.readthedocs.io/en/stable/TypeInspection.html. A simple version of this (without annotations) would be a good fit, in my opinion. This would mean that all we need is to write one function per serializable Bloom filter BF (and all dependent types, like hashers, transitively):

template <class Inspector>
auto inspect(Inspector& f, BF& bf) {
  return f(bf.x, bf.y, bf.z); // x, y, z represent the serializable state
}

Then, we can use this introspection API to support I/O stream serialization, simple string serialization, or whatever we want. The main advantage is that we can reuse the same mechanism for the hashable concept: a type only needs to provide an inspect function and becomes both serializable and hashable. (This is how I designed the concepts in VAST, FWIW)

from libbf.

amallia avatar amallia commented on July 23, 2024

Adding a Boost dependency just for serialization would be overkill. I would like to keep the dependencies as minimal as possible: CMake plus a C++11 compiler. Actually, I think we can bump the requirement to C++14, since most compilers have a solid implementation by now. C++17 would be fun, but it's too cutting edge and we don't really reap the benefits in this library.

Totally agree that Boost is an overkill only for this, but if you consider all the other places where we could use it, then it might make sense to have it. But lets try to proceed without for now.
I don't agree with C++14, there are so many compilers that don't support it (Solaris and AIX are two examples). We might lose stakeholders :)

I will give a look at the framework that you pointed out, but I think that regarding the error handling we could set the ios_base::failbit, with something like:

stream.setstate(ios_base::failbit);

In this way a base bf could implement the extraction/insertion operator, every type of bf specialize a serialization/deserialization method. This may sound less elegant, but it is more pragmatic, postponing the introduction to the next generation of the library.

EDIT: looking better at the framework I think it is not too complicated. Probably we can just go directly with that. Do you have any suggestion on how to serialize a hasher? I was thinking to serialize the values used to generate it (like k, seed and double_hashing).

from libbf.

mavam avatar mavam commented on July 23, 2024

I don't agree with C++14, there are so many compilers that don't support it (Solaris and AIX are two examples). We might lose stakeholders :)

Sticking with C++11 is fine by me if we're going the simple route via overloading the shift operators. If we went for something fancier, like the introspection concept I proposed, then C++11 is a bit bulky. I agree that starting with a simple approach is the right middle-ground to get started.

Do you have any suggestion on how to serialize a hasher? I was thinking to serialize the values used to generate it (like k, seed and double_hashing).

Exactly.

Regarding the API, we have some design options. The low-hanging fruit would be to serialize each T with an overload of this form:

template <class Char, class Traits>
std::basic_ostream<Char, Traits>& operator<<(std::basic_ostream<Char, Traits>& os, const T& x) {
  serialize(x); // implementation
  return os;
}

template <class Char, class Traits>
std::basic_istream<Char, Traits>& operator<<(std::basic_istream<Char, Traits>& is, T& x) {
  deserialize(x); // implementation
  return is;
}

This is the technically the way to parse and print custom types, but we would use it for binary serialization by writing to and reading from the underlying stream buffer. A downside would be that it's now possible to print a type to cout and get gibberish back. But that's the price we pay if we want an interface that works like this:

bloom_filter x;
std::ofstream file{...};
file << x;

from libbf.

mristin avatar mristin commented on July 23, 2024

Hi @mavam
I'd like to give my vote to the serialization as well. The serialization is really necessary In order to use libbf in production with big data -- hoping that the filter can reside in memory is an option only for a very limited set of use cases.

Since the last message was in 2017, do you have a design how you would approach the serialization at the end of 2018 πŸ˜„ ? Maybe you could write it down here, and after a discussion, I might have some time to implement it. (Please recall in the previous messages on this issue that hash functions need to be serialized as well -- so it needs to be an approach working for both the filter and the hash functions.)

I personally prefer the more readable and maintainable, though also more verbose, approach to serialization used by protocol buffers (https://developers.google.com/protocol-buffers/docs/cpptutorial):

  • bool SerializeToString(string* output) const serializes the message and stores the bytes in the given string. Note that the bytes are binary, not text; we only use the string class as a convenient container.
  • bool ParseFromString(const string& data) parses a message from the given string.
  • bool SerializeToOstream(ostream* output) const writes the message to the given C++ ostream.
  • bool ParseFromIstream(istream* input) parses a message from the given C++ istream.

The can be written as an interface as a pure abstract class that the filter classes and hash functions implement.

from libbf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.