GithubHelp home page GithubHelp logo

andreasvc / roaringbitmap Goto Github PK

View Code? Open in Web Editor NEW
79.0 7.0 11.0 307 KB

Roaring Bitmap in Cython

Home Page: http://roaringbitmap.readthedocs.io

License: GNU General Public License v2.0

Makefile 0.67% Python 20.02% C 5.10% Cython 74.21%
python roaring-bitmaps bitset cython datastructures

roaringbitmap's People

Contributors

andreasvc avatar andy-from-miso avatar daveuu avatar ezibenroc avatar sei-eschwartz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

roaringbitmap's Issues

Bug in __getitem__ with slices

There is a bug when the step of the slice is not 1.

Example:

RoaringBitmap([0, 10])[::2]
# RoaringBitmap({0, 10})
[0, 10][::2]
# [0]

The problem is that here you take the intersection of the original roaring bitmap, i.e. {0, 10}, with range(0, 11, 2).
I doubt any generic method involving an intersection with a range will work, since the slice describes the indices and not the content.

What will work is the following:

def __getitem__(self, i):
    return self.__class__(list(self)[i])

But it might be less efficient in some cases.

access RoaringBitmap from cython for static typing

I would like to access the RoaringBitmap class from cython, so I could use it for static typing inside my cython code.

The first problem I have is the package only contains a .pyx file referring to the class
For now I have been trying to import other classes from your package (just for testing), assuming it would work if RoaringBitmap had a dedicated header file.

My current solution is the following
I have a try_extern_cimport.pyx file which contains

cdef extern from "roaringbitmap/src/bitcount.h":
    unsigned int bit_clz(uint64_t) nogil

Installing the package via a git clone command inside the current dir and running cythonize try_extern_cimport.pyx actually works, but my point is doing it via a simple 'pip install'

I think roaringbitmap.pyx needs a dedicated header file and ideally this header file could be accessible via

cdef extern from "roaringbitmap.h":

If I misunderstand something then sorry in advance, I am not even sure it is a practice to expose the header files of a cython package ...

Strange .clamp() behaviour with some ranges

The RoaringBitmap.clamp() method seems to be broken for certain ranges:

e.g.

Setup:
>>> from roaringbitmap import RoaringBitmap
>>> b = RoaringBitmap([1,2,3])

These work as expected:
>>> b.clamp(0, 4)
RoaringBitmap({1, 2, 3})
>>> b.clamp(0, 65535)
RoaringBitmap({1, 2, 3})

These ranges are not behaving consistently with the above:
>>> b.clamp(0, 65536)
RoaringBitmap({})
>>> b.clamp(0, 65537)
RoaringBitmap({})
>>> b.clamp(0, 65538)
RoaringBitmap({1})
>>> b.clamp(0, 65539)
RoaringBitmap({1, 2})

And now we seem to 'wrap around' to expected ranges again:
>>> b.clamp(0, 65540)
RoaringBitmap({1, 2, 3})

If I understand the documentation correctly, all of the above should return a RoaringBitmap({1,2,3}).

`MultiRoaringBitmap.jaccard_dist` against a query coming from an external `RoaringBitmap`

Hi rbm,
Is it possible to calculate bulk Jaccard distances across a MultiRoaringBitmap where the query is not already within the MultiRoaringBitmap?

A straightforward way might be:

multi_rb = MultiRoaringBitmap(list_of_indices, filename='index')
rbm = RoaringBitmap([0,3,6])
jacs = [r.jaccard_dist(b) for b in multi_rb]

Perhaps there's not much overhead working directly in python, but I figured there might be a cleverer/faster way to do this.
Thanks!
Lewis

Bug in intersection_update

If nonempty bitmap is intersection_updated by empty bitmap, it remains nonempty:

In [1]: from roaringbitmap import RoaringBitmap

In [2]: r = RoaringBitmap({1})

In [3]: r.intersection_update(RoaringBitmap([]))

In [4]: r
Out[4]: RoaringBitmap({1})

Feature request: Jaccard statistic

The Jaccard statistic (coefficient?) is a useful measure of similarity between 2 sets:
J(A,B) = A∩B / A∪B

And the distance is:
D(A,B) = 1 − J(A,B)

Doing the counting the cardinalities and sums at the cython level would presumably be quicker than making new bitmaps etc? Good work on .intersection_len() btw 😃

convert Block to a struct

  • requires manual memory management and eliminating all Python data structures, particularly Python array objects.
  • makes it possible to release the GIL while performing operations on Blocks.
  • might be possible to make pickling more efficient.

NameError: name 'extra_link_args' is not defined (setup.py)

When running

venv/bin/pip install --upgrade local_dvcs_package_repos/roaringbitmap

a NameError occurs:

  Running setup.py install for roaringbitmap ...     Running command /home/play/baga_test_dev2/venv/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-15r5__hw-build/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-msngpqmq-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/play/baga_test_dev2/venv/include/site/python3.5/roaringbitmap
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-15r5__hw-build/setup.py", line 95, in <module>
        extra_link_args=extra_link_args,
    NameError: name 'extra_link_args' is not defined

Possibly happening when USE_CYTHON == False during execution of setup.py here ?

MultiRoaringBitmap iteration exceeds length causing segmentation fault

This might not be the anticipated usage pattern but could come up when building new MultiRoaringBitmaps from existing ones (in Python3.5.1, latest roaringbitmap):

from roaringbitmap import MultiRoaringBitmap
from roaringbitmap import RoaringBitmap
from random import sample

for_multi = []
for i in range(5):
    for_multi += [RoaringBitmap(sample(range(99999),200))]

mrb = MultiRoaringBitmap(for_multi)

# True
len(mrb) == 5

# True
mrb[4] == for_multi[4]

# None
mrb[5]

# segmentation fault
mrb[-1]

# segmentation fault
list(mrb)

for n,rb in enumerate(mrb):
    print('This is bitmap number {} (id:{})'.format(n+1,id(rb)))

# ...
# This is bitmap number 16 (id:140270067169880)
# Segmentation fault (core dumped)

could be related:

from roaringbitmap import RoaringBitmap
from roaringbitmap import ImmutableRoaringBitmap

r = RoaringBitmap(range(5))
i = ImmutableRoaringBitmap(r)
# segmentation fault
r = RoaringBitmap(i)

i = ImmutableRoaringBitmap(range(5))
# segmentation fault
r = RoaringBitmap(i)

pickle/unpickle bug

Found a bug when pickling/unpickling a dict with RoaringBitmap elements. Doesn't happen all the time, but does in the case attached.

dd.out - the dict that was pickled
*.pickle the pickled data
bug.out - ipython3 session showing the bug

bug.zip

Feature request: multi-threaded MultiRoaringBitmap e.g., jaccard_dist()

In the case when many set comparisons are to be carried out, processing the contents of one MultiRoaringBitmap with several CPU cores would be more memory efficient than creating several MultiRoaringBitmaps with some common content to process each with one core. I know the GIL is freed in current implementation maybe invoking parallelism is possible from within Python? Alternatively, it seems Cython might have convenient facilities for parallel loops:
http://docs.cython.org/src/userguide/parallelism.html
I'll have a fiddle although it would be new coding territory for me.

Usage as a library?

@andreasvc could you consider may be using the LGPL instead of the GPL? this is a library after all and makes it hard to reuse in Apache-licensed code...
Thank you for your kind consideration

Intersection of 2 large sets causing aborts

from pickle import load

print("set1")
with open('rset', 'rb') as ifile:
    rset=load(ifile)

print("set2")
with open('orset', 'rb') as ifile:
    orset=load(ifile)

print("intersection")
rset.intersection_update(orset)
print("complete")

aborts in the intersection:

Python(75946,0x7fff7704b000) malloc: *** mach_vm_map(size=18446744073709436928) failed (error code=3)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Abort trap: 6

test files at https://www.dropbox.com/s/yvahy3ynkdzvgl3/roaringbitmap_bug.tgz?dl=0

reproduces on OSX and Ubuntu 14.04 both in python3

MultiRoaringBitmap slicing return type

currently a it's possible to slice a MultiRoaringBitmap, but it returns a list of ImmutableRoaringBitmap, see this line

>>> from roaringbitmap import MultiRoaringBitmap
>>> mrb = MultiRoaringBitmap([range(i) for i in range(2, 10)])
>>> mrb[:3]
[ImmutableRoaringBitmap({0, 1}),
 ImmutableRoaringBitmap({0, 1, 2}),
 ImmutableRoaringBitmap({0, 1, 2, 3})]

It's actually possible to get a new MultiRoaringBitmap, by calling the constructor

>>> MultiRoaringBitmap(mrb[0:5])
<roaringbitmap.MultiRoaringBitmap at 0x7fa0736859a8>

but I think this makes a copy of the Bitmaps in question

Would that make sense if MultiRoaringBitmap.__getitem__ returned an instance of MultiRoaringBitmap ?

Strange behavior using git version of roaring bitmap

Basically, this code triggers the if branch for some values of set1 and set2:

seen|=set1
seen|=set2

if len(seen - set1 - set2) != 0:
    print "What?", len(seen - set1 - set2)

Full example available here. I actually ran this through a testcase minimizer and it was obviously unable to do much, which suggests this is a difficult to trigger bug.

xor, difference are incorrect on large bitmaps

xor:

>>> len(RoaringBitmap(range(0, 61440)) ^ RoaringBitmap(range(0, 61440)))
0
>>> len(RoaringBitmap(range(0, 61441)) ^ RoaringBitmap(range(0, 61441)))
65536

this also causes issues with flip_range.

difference:

>>> len(RoaringBitmap(range(0, 61440)) - RoaringBitmap(range(0, 61440)))
0
>>> len(RoaringBitmap(range(0, 61441)) - RoaringBitmap(range(0, 61441)))
61441

Compute intersection for pairs within a MultiRoaringBitmap

Feature request for a method like MultiRoaringBitmap.jaccard_dist() but for pairwise intersections. I realise a Python loop through zip(indices1, indices2) with calls to MultiRoaringBitmap.intersection() with pairs of indices will work, but a MultiRoaringBitmap.intersection_pw()-like function looping at the C-level would be quite a lot more 🚀.

FWIW I have a slightly niche scenario of wanting to measure distances between many sets of sets. I plan to get the intersection of each set of sets, then measure the intersection of intersections between each pair. I'm thinking the denominator in a conventional jaccard distance wouldn't properly capture content of each set of sets . . . so I might use the union of each collection. Either way many fast intersections would be useful and the implementation might just be a cut down MultiRoaringBitmap.jaccard_dist()? (and potentially useful in other scenarios?)

some specific values can reliably segfault clamp()

hey there, i can reliably cause a segfault on my system with the following:

from roaringbitmap import RoaringBitmap

rbm = RoaringBitmap()
rbm.add(3995084765)
rbm.clamp(0,8388607)

other things work fine such as:

rbm = RoaringBitmap()
rbm.add(10)
rbm.clamp(0,5)

version i am running is:

$ python --version
Python 3.7.0
$ pip list 
Package       Version
------------- -------
pip           18.1   
roaringbitmap 0.6    
setuptools    40.6.2 
wheel         0.32.3 
(.py) 

on macos 10.13.4.

Bug in difference_update

New elements appear in bitmap after calling difference_update:

In [1]: from roaringbitmap import RoaringBitmap

In [2]: r = RoaringBitmap(range(131071))

In [3]: r.pop()
Out[3]: 131070

In [4]: r.pop()
Out[4]: 131069

In [5]: r.difference_update(RoaringBitmap([130752]))

In [6]: r.pop()
Out[6]: 131070

Nonempty RoaringBitmap throws ValueError on pop()

Following code throws ValueError:

bits = RoaringBitmap([60748, 28806, 54664, 28597, 58922, 75684, 56364, 67421, 52608, 
                      55686, 10427, 48506, 64363, 14506, 73077, 59035, 70246, 19875, 
                      73145, 40225, 58664, 6597, 65554, 73102, 26636, 74227, 59566, 
                      19023])
while bits:
    bits.pop()

arrays with elements <4 bytes cause MultiRoaringBitmap.jaccard_dist() seg fault

from roaringbitmap import RoaringBitmap
from roaringbitmap import MultiRoaringBitmap
from array import array
a = RoaringBitmap(array('L',range(1,4)))
b = RoaringBitmap(array('L',range(3,6)))
c = RoaringBitmap(array('L',range(5,8)))
mrb = MultiRoaringBitmap([a,b,c])
# True
array('d', [0.8, 1.0]) == mrb.jaccard_dist(array('L',[1,1]),array('L',[2,3]))
array('d', [0.8, 1.0]) == mrb.jaccard_dist(array('Q',[1,1]),array('Q',[2,3]))
# these cause seg fault
d = mrb.jaccard_dist(array('H',[1,1]),array('H',[2,3]))
d = mrb.jaccard_dist(array('h',[1,1]),array('h',[2,3]))
d = mrb.jaccard_dist(array('I',[1,1]),array('I',[2,3]))
d = mrb.jaccard_dist(array('i',[1,1]),array('i',[2,3]))

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.