Comments (9)
Hi Roy,
thanks for dropping a note. An ability to serialize the Bloom filters is becoming increasingly requested. Unfortunately, it doesn't suffice to simply expose the bit vector internals. You also need to remember what hash function and what seed you have been used to fully reconstruct the Bloom filter. In a topic branch I will add the proposed N3980 hash_append
functionality. This will give not only users a choice among other hash function implementations, but also improve hashing of custom types.
While doing so, I plan to improve support for serialization, which is currently lacking. It will very likely via free functions, however, and not member functions. I'd like to stay API-compatible with Boost Serialization:
template <class Processor>
void serialize(Processor& proc, basic_bloom_filter& bf) {
// serialize members here
}
That said, I'm currently pretty backed up with other projects and cannot promise you when this will be available.
from libbf.
Thank you for the response.
I have tried to read the N3980 but is way out of my knowledge.
From what I have seen if you use always the same random number in the initialization such portability is doable... remaining on the same machine.
If you think is a possibility I`ll do test between different machine.
BTW I`ll try to read again the doc after some sleep!
from libbf.
This would be a very useful feature of this excellent library. Any news on that?
Thanks!
from libbf.
Unfortunately I'm lacking the cycles to pursue this myself at the moment, but I'm happy to supervise contributions.
from libbf.
@mavam do we want to use Boost::serialization here?
If not, lets clarify if we want binary serialization or human-readable serialization with operator>>
& operation<<
override.
from libbf.
@mavam do we want to use Boost::serialization here?
Adding a Boost dependency just for serialization would be overkill. I would like to keep the dependencies as minimal as possible: CMake plus a C++11 compiler. Actually, I think we can bump the requirement to C++14, since most compilers have a solid implementation by now. C++17 would be fun, but it's too cutting edge and we don't really reap the benefits in this library.
If not, lets clarify if we want binary serialization or human-readable serialization with
operator>>
&operation<<
override.
Using shift operators is indeed the most common model:
std::istream is;
bloom_filter bf;
is >> bf; // throws exception on failure?
As hinted in the comment, the error handling is a bit awkward. So, let's take one step back and think about an introspection framework that we can then use to generate those overloads where needed. We have something really neat in CAF: http://actor-framework.readthedocs.io/en/stable/TypeInspection.html. A simple version of this (without annotations) would be a good fit, in my opinion. This would mean that all we need is to write one function per serializable Bloom filter BF
(and all dependent types, like hashers, transitively):
template <class Inspector>
auto inspect(Inspector& f, BF& bf) {
return f(bf.x, bf.y, bf.z); // x, y, z represent the serializable state
}
Then, we can use this introspection API to support I/O stream serialization, simple string serialization, or whatever we want. The main advantage is that we can reuse the same mechanism for the hashable concept: a type only needs to provide an inspect
function and becomes both serializable and hashable. (This is how I designed the concepts in VAST, FWIW)
from libbf.
Adding a Boost dependency just for serialization would be overkill. I would like to keep the dependencies as minimal as possible: CMake plus a C++11 compiler. Actually, I think we can bump the requirement to C++14, since most compilers have a solid implementation by now. C++17 would be fun, but it's too cutting edge and we don't really reap the benefits in this library.
Totally agree that Boost is an overkill only for this, but if you consider all the other places where we could use it, then it might make sense to have it. But lets try to proceed without for now.
I don't agree with C++14, there are so many compilers that don't support it (Solaris and AIX are two examples). We might lose stakeholders :)
I will give a look at the framework that you pointed out, but I think that regarding the error handling we could set the ios_base::failbit
, with something like:
stream.setstate(ios_base::failbit);
In this way a base bf could implement the extraction/insertion operator, every type of bf specialize a serialization/deserialization method. This may sound less elegant, but it is more pragmatic, postponing the introduction to the next generation of the library.
EDIT: looking better at the framework I think it is not too complicated. Probably we can just go directly with that. Do you have any suggestion on how to serialize a hasher? I was thinking to serialize the values used to generate it (like k, seed and double_hashing).
from libbf.
I don't agree with C++14, there are so many compilers that don't support it (Solaris and AIX are two examples). We might lose stakeholders :)
Sticking with C++11 is fine by me if we're going the simple route via overloading the shift operators. If we went for something fancier, like the introspection concept I proposed, then C++11 is a bit bulky. I agree that starting with a simple approach is the right middle-ground to get started.
Do you have any suggestion on how to serialize a hasher? I was thinking to serialize the values used to generate it (like k, seed and double_hashing).
Exactly.
Regarding the API, we have some design options. The low-hanging fruit would be to serialize each T
with an overload of this form:
template <class Char, class Traits>
std::basic_ostream<Char, Traits>& operator<<(std::basic_ostream<Char, Traits>& os, const T& x) {
serialize(x); // implementation
return os;
}
template <class Char, class Traits>
std::basic_istream<Char, Traits>& operator<<(std::basic_istream<Char, Traits>& is, T& x) {
deserialize(x); // implementation
return is;
}
This is the technically the way to parse and print custom types, but we would use it for binary serialization by writing to and reading from the underlying stream buffer. A downside would be that it's now possible to print a type to cout
and get gibberish back. But that's the price we pay if we want an interface that works like this:
bloom_filter x;
std::ofstream file{...};
file << x;
from libbf.
Hi @mavam
I'd like to give my vote to the serialization as well. The serialization is really necessary In order to use libbf in production with big data -- hoping that the filter can reside in memory is an option only for a very limited set of use cases.
Since the last message was in 2017, do you have a design how you would approach the serialization at the end of 2018 π ? Maybe you could write it down here, and after a discussion, I might have some time to implement it. (Please recall in the previous messages on this issue that hash functions need to be serialized as well -- so it needs to be an approach working for both the filter and the hash functions.)
I personally prefer the more readable and maintainable, though also more verbose, approach to serialization used by protocol buffers (https://developers.google.com/protocol-buffers/docs/cpptutorial):
bool SerializeToString(string* output) const
serializes the message and stores the bytes in the given string. Note that the bytes are binary, not text; we only use the string class as a convenient container.bool ParseFromString(const string& data)
parses a message from the given string.bool SerializeToOstream(ostream* output) const
writes the message to the given C++ ostream.bool ParseFromIstream(istream* input)
parses a message from the given C++ istream.
The can be written as an interface as a pure abstract class that the filter classes and hash functions implement.
from libbf.
Related Issues (20)
- Avoid assert failure on bits_.size() % digests.size() HOT 2
- Path with space
- Possible performance issue when performing look-ups for non-existent entries HOT 3
- N3980: a more flexible way of hashing HOT 3
- make test failure
- how to install the code in os x? HOT 5
- Possible performance improvement by replacing % (modulo) with multiply + shift HOT 3
- Wrapping for Python 3 HOT 6
- Conan package HOT 5
- bits.size % digests.size() assert error make this library not friendly. HOT 3
- Remove dependency on Threads HOT 1
- MSVC not supported? HOT 1
- aux is a reserved file name on windows HOT 3
- Error while building on windows using MSVC HOT 1
- Segmentation fault when running in thread HOT 1
- Correction about counting bf. HOT 3
- GCC: Version 10.0.1 "runtime_errorβ is not a member of βstd" in hash.cpp HOT 2
- Incorrect header inclusion
- latest tag release HOT 1
- How can I add raw bytes, or non-null terminated strings? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from libbf.