andreasvc / roaringbitmap Goto Github PK
View Code? Open in Web Editor NEWRoaring Bitmap in Cython
Home Page: http://roaringbitmap.readthedocs.io
License: GNU General Public License v2.0
Roaring Bitmap in Cython
Home Page: http://roaringbitmap.readthedocs.io
License: GNU General Public License v2.0
There is a bug when the step
of the slice is not 1.
Example:
RoaringBitmap([0, 10])[::2]
# RoaringBitmap({0, 10})
[0, 10][::2]
# [0]
The problem is that here you take the intersection of the original roaring bitmap, i.e. {0, 10}
, with range(0, 11, 2)
.
I doubt any generic method involving an intersection with a range will work, since the slice describes the indices and not the content.
What will work is the following:
def __getitem__(self, i):
return self.__class__(list(self)[i])
But it might be less efficient in some cases.
In [9]: with open('bug.pickle','rb') as ifile:
...: unp=Unpickler(ifile)
...: d=unp.load()
...:
In [10]: len(d)
Out[10]: 544539
In [13]: 544539/4
Out[13]: 136134.75
In [16]: len(d[0:136135])
Out[16]: 136135
In [17]: len(d[136135:136135*2])
Out[17]: 128376
???
Sample data at: https://www.dropbox.com/s/ytwj4fiv786ic6b/bug.pickle?dl=0
I would like to access the RoaringBitmap class from cython
, so I could use it for static typing inside my cython code.
The first problem I have is the package only contains a .pyx
file referring to the class
For now I have been trying to import other classes from your package (just for testing), assuming it would work if RoaringBitmap had a dedicated header file.
My current solution is the following
I have a try_extern_cimport.pyx
file which contains
cdef extern from "roaringbitmap/src/bitcount.h":
unsigned int bit_clz(uint64_t) nogil
Installing the package via a git clone
command inside the current dir and running cythonize try_extern_cimport.pyx
actually works, but my point is doing it via a simple 'pip install'
I think roaringbitmap.pyx
needs a dedicated header file and ideally this header file could be accessible via
cdef extern from "roaringbitmap.h":
If I misunderstand something then sorry in advance, I am not even sure it is a practice to expose the header files of a cython package ...
The RoaringBitmap.clamp()
method seems to be broken for certain ranges:
e.g.
Setup:
>>> from roaringbitmap import RoaringBitmap
>>> b = RoaringBitmap([1,2,3])
These work as expected:
>>> b.clamp(0, 4)
RoaringBitmap({1, 2, 3})
>>> b.clamp(0, 65535)
RoaringBitmap({1, 2, 3})
These ranges are not behaving consistently with the above:
>>> b.clamp(0, 65536)
RoaringBitmap({})
>>> b.clamp(0, 65537)
RoaringBitmap({})
>>> b.clamp(0, 65538)
RoaringBitmap({1})
>>> b.clamp(0, 65539)
RoaringBitmap({1, 2})
And now we seem to 'wrap around' to expected ranges again:
>>> b.clamp(0, 65540)
RoaringBitmap({1, 2, 3})
If I understand the documentation correctly, all of the above should return a RoaringBitmap({1,2,3})
.
Not sure if this is universally desirable but would make installation more convenient
Hi rbm,
Is it possible to calculate bulk Jaccard distances across a MultiRoaringBitmap
where the query is not already within the MultiRoaringBitmap
?
A straightforward way might be:
multi_rb = MultiRoaringBitmap(list_of_indices, filename='index')
rbm = RoaringBitmap([0,3,6])
jacs = [r.jaccard_dist(b) for b in multi_rb]
Perhaps there's not much overhead working directly in python, but I figured there might be a cleverer/faster way to do this.
Thanks!
Lewis
If nonempty bitmap is intersection_updated by empty bitmap, it remains nonempty:
In [1]: from roaringbitmap import RoaringBitmap
In [2]: r = RoaringBitmap({1})
In [3]: r.intersection_update(RoaringBitmap([]))
In [4]: r
Out[4]: RoaringBitmap({1})
The Jaccard statistic (coefficient?) is a useful measure of similarity between 2 sets:
J(A,B) = A∩B / A∪B
And the distance is:
D(A,B) = 1 − J(A,B)
Doing the counting the cardinalities and sums at the cython level would presumably be quicker than making new bitmaps etc? Good work on .intersection_len()
btw 😃
Is there a plan to add run length encoding to this library?
When running
venv/bin/pip install --upgrade local_dvcs_package_repos/roaringbitmap
a NameError occurs:
Running setup.py install for roaringbitmap ... Running command /home/play/baga_test_dev2/venv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-15r5__hw-build/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-msngpqmq-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/play/baga_test_dev2/venv/include/site/python3.5/roaringbitmap
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-15r5__hw-build/setup.py", line 95, in <module>
extra_link_args=extra_link_args,
NameError: name 'extra_link_args' is not defined
Possibly happening when USE_CYTHON == False
during execution of setup.py here ?
This might not be the anticipated usage pattern but could come up when building new MultiRoaringBitmaps from existing ones (in Python3.5.1, latest roaringbitmap):
from roaringbitmap import MultiRoaringBitmap
from roaringbitmap import RoaringBitmap
from random import sample
for_multi = []
for i in range(5):
for_multi += [RoaringBitmap(sample(range(99999),200))]
mrb = MultiRoaringBitmap(for_multi)
# True
len(mrb) == 5
# True
mrb[4] == for_multi[4]
# None
mrb[5]
# segmentation fault
mrb[-1]
# segmentation fault
list(mrb)
for n,rb in enumerate(mrb):
print('This is bitmap number {} (id:{})'.format(n+1,id(rb)))
# ...
# This is bitmap number 16 (id:140270067169880)
# Segmentation fault (core dumped)
could be related:
from roaringbitmap import RoaringBitmap
from roaringbitmap import ImmutableRoaringBitmap
r = RoaringBitmap(range(5))
i = ImmutableRoaringBitmap(r)
# segmentation fault
r = RoaringBitmap(i)
i = ImmutableRoaringBitmap(range(5))
# segmentation fault
r = RoaringBitmap(i)
Found a bug when pickling/unpickling a dict with RoaringBitmap elements. Doesn't happen all the time, but does in the case attached.
dd.out - the dict that was pickled
*.pickle the pickled data
bug.out - ipython3 session showing the bug
In the case when many set comparisons are to be carried out, processing the contents of one MultiRoaringBitmap with several CPU cores would be more memory efficient than creating several MultiRoaringBitmaps with some common content to process each with one core. I know the GIL is freed in current implementation maybe invoking parallelism is possible from within Python? Alternatively, it seems Cython might have convenient facilities for parallel loops:
http://docs.cython.org/src/userguide/parallelism.html
I'll have a fiddle although it would be new coding territory for me.
@andreasvc could you consider may be using the LGPL instead of the GPL? this is a library after all and makes it hard to reuse in Apache-licensed code...
Thank you for your kind consideration
from pickle import load
print("set1")
with open('rset', 'rb') as ifile:
rset=load(ifile)
print("set2")
with open('orset', 'rb') as ifile:
orset=load(ifile)
print("intersection")
rset.intersection_update(orset)
print("complete")
aborts in the intersection:
Python(75946,0x7fff7704b000) malloc: *** mach_vm_map(size=18446744073709436928) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Abort trap: 6
test files at https://www.dropbox.com/s/yvahy3ynkdzvgl3/roaringbitmap_bug.tgz?dl=0
reproduces on OSX and Ubuntu 14.04 both in python3
currently a it's possible to slice a MultiRoaringBitmap
, but it returns a list of ImmutableRoaringBitmap
, see this line
>>> from roaringbitmap import MultiRoaringBitmap
>>> mrb = MultiRoaringBitmap([range(i) for i in range(2, 10)])
>>> mrb[:3]
[ImmutableRoaringBitmap({0, 1}),
ImmutableRoaringBitmap({0, 1, 2}),
ImmutableRoaringBitmap({0, 1, 2, 3})]
It's actually possible to get a new MultiRoaringBitmap, by calling the constructor
>>> MultiRoaringBitmap(mrb[0:5])
<roaringbitmap.MultiRoaringBitmap at 0x7fa0736859a8>
but I think this makes a copy of the Bitmaps in question
Would that make sense if MultiRoaringBitmap.__getitem__
returned an instance of MultiRoaringBitmap
?
>>> r=RoaringBitmap([0])
>>> len(r)
1
>>> r.numelem()
2
Basically, this code triggers the if branch for some values of set1
and set2
:
seen|=set1
seen|=set2
if len(seen - set1 - set2) != 0:
print "What?", len(seen - set1 - set2)
Full example available here. I actually ran this through a testcase minimizer and it was obviously unable to do much, which suggests this is a difficult to trigger bug.
xor:
>>> len(RoaringBitmap(range(0, 61440)) ^ RoaringBitmap(range(0, 61440)))
0
>>> len(RoaringBitmap(range(0, 61441)) ^ RoaringBitmap(range(0, 61441)))
65536
this also causes issues with flip_range
.
difference:
>>> len(RoaringBitmap(range(0, 61440)) - RoaringBitmap(range(0, 61440)))
0
>>> len(RoaringBitmap(range(0, 61441)) - RoaringBitmap(range(0, 61441)))
61441
As reported in andreasvc/disco-dop#68 (comment)_
Feature request for a method like MultiRoaringBitmap.jaccard_dist()
but for pairwise intersections. I realise a Python loop through zip(indices1, indices2)
with calls to MultiRoaringBitmap.intersection()
with pairs of indices will work, but a MultiRoaringBitmap.intersection_pw()
-like function looping at the C-level would be quite a lot more 🚀.
FWIW I have a slightly niche scenario of wanting to measure distances between many sets of sets. I plan to get the intersection of each set of sets, then measure the intersection of intersections between each pair. I'm thinking the denominator in a conventional jaccard distance wouldn't properly capture content of each set of sets . . . so I might use the union of each collection. Either way many fast intersections would be useful and the implementation might just be a cut down MultiRoaringBitmap.jaccard_dist()
? (and potentially useful in other scenarios?)
hey there, i can reliably cause a segfault on my system with the following:
from roaringbitmap import RoaringBitmap
rbm = RoaringBitmap()
rbm.add(3995084765)
rbm.clamp(0,8388607)
other things work fine such as:
rbm = RoaringBitmap()
rbm.add(10)
rbm.clamp(0,5)
version i am running is:
$ python --version
Python 3.7.0
$ pip list
Package Version
------------- -------
pip 18.1
roaringbitmap 0.6
setuptools 40.6.2
wheel 0.32.3
(.py)
on macos 10.13.4.
New elements appear in bitmap after calling difference_update
:
In [1]: from roaringbitmap import RoaringBitmap
In [2]: r = RoaringBitmap(range(131071))
In [3]: r.pop()
Out[3]: 131070
In [4]: r.pop()
Out[4]: 131069
In [5]: r.difference_update(RoaringBitmap([130752]))
In [6]: r.pop()
Out[6]: 131070
Following code throws ValueError
:
bits = RoaringBitmap([60748, 28806, 54664, 28597, 58922, 75684, 56364, 67421, 52608,
55686, 10427, 48506, 64363, 14506, 73077, 59035, 70246, 19875,
73145, 40225, 58664, 6597, 65554, 73102, 26636, 74227, 59566,
19023])
while bits:
bits.pop()
from roaringbitmap import RoaringBitmap
from roaringbitmap import MultiRoaringBitmap
from array import array
a = RoaringBitmap(array('L',range(1,4)))
b = RoaringBitmap(array('L',range(3,6)))
c = RoaringBitmap(array('L',range(5,8)))
mrb = MultiRoaringBitmap([a,b,c])
# True
array('d', [0.8, 1.0]) == mrb.jaccard_dist(array('L',[1,1]),array('L',[2,3]))
array('d', [0.8, 1.0]) == mrb.jaccard_dist(array('Q',[1,1]),array('Q',[2,3]))
# these cause seg fault
d = mrb.jaccard_dist(array('H',[1,1]),array('H',[2,3]))
d = mrb.jaccard_dist(array('h',[1,1]),array('h',[2,3]))
d = mrb.jaccard_dist(array('I',[1,1]),array('I',[2,3]))
d = mrb.jaccard_dist(array('i',[1,1]),array('i',[2,3]))
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.