danielstutzbach / blist Goto Github PK

View Code? Open in Web Editor NEW

311.0 311.0 36.0 820 KB

A list-like type with better asymptotic performance and similar performance on small lists

License: Other

Python 57.09% C 42.91%

blist's People

Stargazers

Watchers

blist's Issues

sizeof unimplemented for iterators

They currently fall back to the default object implementation, returning a value that is too small.

sorteddict and btuple have meaningless slots, other collections have no slots at all

When subclassing sorteddict I defined __slots__ attribute and realized that it was futile, because -- although sorteddict have __slots__, it's meaningless, because some superclasses (from _abcoll...) have not. Some other collections don't have __slots__ at all... It'd be nice to be able make use of __slots__... (sometimes there are plenty of collection instances and that's my case).

Python 3.1.2 (release31-maint, Sep 26 2010, 13:51:01) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from blist import *
>>> bl = blist()
>>> bl.spam = 3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: '_blist.blist' object has no attribute 'spam'
>>> 
>>> bt = btuple()
>>> bt.spam = 3
>>> btuple.__slots__
['_hash', '_blist']
>>> 
>>> sd = sorteddict()
>>> sd.spam = 3
>>> sorteddict.__slots__
['_sortedkeys', '_map']
>>> 
>>> sl = sortedlist()
>>> sl.spam = 3
>>> sl.__slots__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'sortedlist' object has no attribute '__slots__'
>>> 
>>> s = sortedset()
>>> s.spam = 3
>>> sortedset.__slots__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'sortedset' has no attribute '__slots__'
>>> 
>>> wsl = weaksortedlist()
>>> wsl.spam = 3
>>> wsl.__slots__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'weaksortedlist' object has no attribute '__slots__'
>>> 
>>> ws = weaksortedset()
>>> ws.spam = 3
>>> ws.__slots__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'weaksortedset' object has no attribute '__slots__'

Limited iteration

Something B+ trees are used for a lot is to answer the question "which items satisfy p >= start and p < stop" - this I reckon should be possible to answer with the blist implementation but is not possible via the APIs exposed it seems?

The usecase is using blist to make a database indexing engine.

sortedset should support deleting by index or range of indexes

sortedlist should support a peek() operation

sortedlist should support a peek() operation so the next value to be pop()ed can be previewed before popping

blist.blist subclasses can't have custom operators

I assume this is categorized as list compatibility.

class CustomList(list):
  def __div__(self, count):
    return 5
>>> l= CustomList()
>>> l/1
5

class CustomBlist(blist.blist):
  def __div__(self, count):
    return 4
>>> l= CustomBlist()
>>> l/1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for /: 'CustomBlist' and 'int'

start, stop parameters are missing from a few functions in the sortedlist family.

O(log n) add and remove for sorted*

A C implementation of sorted* could give O(log n) add and remove methods. It remains to be seen if that would really boost performance, as the number of comparisons is already O(log n) and the tree can't be that tall anyway -- max height is now 16 in the C source.

Speed up contains, compare, etc. operations by making an O(1) copy first

Right now, the blist has to perform a lot of checking after each comparison operation to make sure the comparison function did not permute the list. We could save a lot of time by making an O(1) copy first.

I'm not sure if that would pay off for small lists (where the root is also a leaf), but it might.

For long lists, it should be a clear win. We should probably be careful to avoid starting the cache in the root which would use O(n) memory.

Make sure that deepcopy is efficient

It should copy the internal blist structure, making it very fast when copy-on-write is being heavily used. Observe how list() handles it:

>>> class blah:
...  pass
...
>>> x = blah()
>>> y = [x,x]
>>> z = copy.deepcopy(y)
>>> id(z[0]), id(z[1])
(2146287308, 2146287308)
>>> id(x)
2146287244

less performant than standard list

the performance on sorting can be inferior at standard python list (from an example of performance of timsort)

>>> import blist
>>> l = [i for i in range(1048576)]
>>> for i in range(3):
import random
r1 = random.randint(0,len(l)-1)
r2 = random.randint(0,len(l)-1)
l[r1],l[r2] = l[r2],l[r1]


>>> bl = blist.blist(l)
>>> def t(l):
    import time
    before = time.time()
    l.sort()
    return time.time() - before

>>> t(l)
0.04699993133544922
>>> t(bl)
0.28100013732910156

AttributeError when indexing ItemsView of a sorteddict

In [11]: d = blist.sorteddict()

In [12]: d[1] = 2

In [13]: d.items()
Out[13]: ItemsView(sorteddict({1: 2}))

In [14]: d.items()[0]

AttributeError Traceback (most recent call last)
in ()
----> 1 d.items()[0]

/usr/local/lib/python3.3/dist-packages/blist-1.3.4-py3.3-linux-x86_64.egg/_sorteddict.py in getitem(self, index)
30 return self._from_iterable((key, self._mapping[key])
31 for key in keys)
---> 32 key = self._mapping.sortedkeys[index]
33 return (key, self._mapping[key])
34 def index(self, item):

AttributeError: 'sorteddict' object has no attribute 'sortedkeys'

This bug is also signaled on stackoverflow: http://stackoverflow.com/questions/11285312/indexing-the-valuesview-of-a-sorteddict .

Add a sortedlist type

For read and delete operations, works just like a list.

For insert operations, works just like a set.

The constructor should take an optional "key" function.

sudo pip install btree fails

Ubuntu 12.04

$ sudo pip install btree
Downloading/unpacking btree
Running setup.py egg_info for package btree
Installing collected packages: btree
Running setup.py install for btree
btree.c:35:19: fatal error: btree.h: No such file or directory
compilation terminated.
building 'btree' extension
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c btree.c -o build/temp.linux-i686-2.7/btree.o
Traceback (most recent call last):
File "", line 1, in
File "/home/username/build/btree/setup.py", line 7, in
paver.tasks.main()
File "paver-minilib.zip/paver/tasks.py", line 621, in main
File "paver-minilib.zip/paver/tasks.py", line 604, in _launch_pavement
File "paver-minilib.zip/paver/tasks.py", line 569, in _process_commands
File "paver-minilib.zip/paver/setuputils.py", line 146, in call
File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/dist-packages/setuptools/command/install.py", line 53, in run
return _install.run(self)
File "/usr/lib/python2.7/distutils/command/install.py", line 601, in run
self.run_command('build')
File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/distutils/command/build.py", line 128, in run
self.run_command(cmd_name)
File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/dist-packages/setuptools/command/build_ext.py", line 46, in run
_build_ext.run(self)
File "/usr/lib/python2.7/distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/usr/lib/python2.7/distutils/command/build_ext.py", line 448, in build_extensions
self.build_extension(ext)
File "/usr/lib/python2.7/dist-packages/setuptools/command/build_ext.py", line 182, in build_extension
_build_ext.build_extension(self,ext)
File "/usr/lib/python2.7/distutils/command/build_ext.py", line 498, in build_extension
depends=ext.depends)
File "/usr/lib/python2.7/distutils/ccompiler.py", line 572, in compile
self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
File "/usr/lib/python2.7/distutils/unixccompiler.py", line 180, in _compile
raise CompileError, msg
distutils.errors.CompileError: command 'gcc' failed with exit status 1
Complete output from command /usr/bin/python -c "import setuptools;file='/home/username/build/btree/setup.py';exec(compile(open(file).read().replace('\r\n', '\n'), file, 'exec'))" install --single-version-externally-managed --record /tmp/pip-NmMbGC-record/install-record.txt:
btree.c:35:19: fatal error: btree.h: No such file or directory

compilation terminated.

running install

running build

running build_ext

building 'btree' extension

gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c btree.c -o build/temp.linux-i686-2.7/btree.o

Traceback (most recent call last):

File "", line 1, in

File "/home/username/build/btree/setup.py", line 7, in

paver.tasks.main()

File "paver-minilib.zip/paver/tasks.py", line 621, in main

File "paver-minilib.zip/paver/tasks.py", line 604, in _launch_pavement

File "paver-minilib.zip/paver/tasks.py", line 569, in _process_commands

File "paver-minilib.zip/paver/setuputils.py", line 146, in call

File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command

cmd_obj.run()

File "/usr/lib/python2.7/dist-packages/setuptools/command/install.py", line 53, in run

return _install.run(self)

File "/usr/lib/python2.7/distutils/command/install.py", line 601, in run

self.run_command('build')

File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command

self.distribution.run_command(command)

File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command

cmd_obj.run()

File "/usr/lib/python2.7/distutils/command/build.py", line 128, in run

self.run_command(cmd_name)

File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command

self.distribution.run_command(command)

File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command

cmd_obj.run()

File "/usr/lib/python2.7/dist-packages/setuptools/command/build_ext.py", line 46, in run

_build_ext.run(self)

File "/usr/lib/python2.7/distutils/command/build_ext.py", line 339, in run

self.build_extensions()

File "/usr/lib/python2.7/distutils/command/build_ext.py", line 448, in build_extensions

self.build_extension(ext)

File "/usr/lib/python2.7/dist-packages/setuptools/command/build_ext.py", line 182, in build_extension

_build_ext.build_extension(self,ext)

File "/usr/lib/python2.7/distutils/command/build_ext.py", line 498, in build_extension

depends=ext.depends)

File "/usr/lib/python2.7/distutils/ccompiler.py", line 572, in compile

self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)

File "/usr/lib/python2.7/distutils/unixccompiler.py", line 180, in _compile

raise CompileError, msg

distutils.errors.CompileError: command 'gcc' failed with exit status 1

Command /usr/bin/python -c "import setuptools;file='/home/username/build/btree/setup.py';exec(compile(open(file).read().replace('\r\n', '\n'), file, 'exec'))" install --single-version-externally-managed --record /tmp/pip-NmMbGC-record/install-record.txt failed with error code 1
Storing complete log in /home/username/.pip/pip.log

sortedlist needs an insort_left-like method

Although sortedlist provides methods to obtain the appropriate index to introduce a new element in pre-order and post-order (a.k.a. bisect_left and bisect_right), it is impossible to use those methods to introduce a new element in the desired ordered position, because the 'add' method only supports post-order. To achieve this, at this time you need to access to protected attributes and methods, which is not very convenient.

.extend must check the new length will be <= PY_SSIZE_T_MAX

The following should throw an exception:

x = blist(range(10))
while True:
    x.extend(x)

Make very small lists memory-efficient

Right now, a full node is allocated regardless of the list size. For very small lists (where the root is a leaf), we should mimic the behavior of Python's built-in list.

Add .clear() and .copy() methods, to match those added to Python's list in 3.2

Write documentation for all types and methods

Dynamically allocate space in the root node.

Right now, small lists allocate enough space for LIMIT children, even if only one or two are actually needed. For small lists (where the root is a leaf), the memory footprint is terrible compared to an array-based list. The root needs to be able to dynamically choose the number of children. However, it needs to be very careful because hitherto a root node has been treated as a subclass of a regular node. With this change, routines cannot assume that any node has space for LIMIT children.

Case insensitive sortedset

Would be great to see a case insensitive version of sortedset

Doing this now for sorting is easy because you have the key parameter (sortedlist(foo, key=lambda x: x.lower()))

However in those cases you also want to de-dupe in a case insensitive way, which it doesn't do.

If there's an easy way to accomplish this now, I'd love to hear it. The best I've come up with so far is:

class isortedset(sortedset):
    def __contains__(self, key):
        return key.lower() in (n.lower() for n in self)

but this breaks the nice time complexity

Support Version Numbers, as described in PEP396

I have scripts to check that my servers have the correct installation of modules. Part of that check is to confirm the installed version of the modules is the same as what was tested against. It alerts me if new versions have been released so I can inspect the changes.

The blist module does not appear to share its own version number, making such checks more problematic.

PEP396 describes a standard mechanism for providing version numbers in Python.

Please consider adhering to PEP396 for future releases.

Segmentation Fault

Trying to use a blist of size 4**20 and I get segmentation fault upon first read/write to blist.
Using latest version from git.
Running on Mac OSX on x86_64 with latest Xcode/gcc install.
Commands to replicate error:

[abhik:/Downloads]$ git clone https://github.com/DanielStutzbach/blist.git
Cloning into blist...
remote: Counting objects: 1169, done.
remote: Compressing objects: 100% (523/523), done.
remote: Total 1169 (delta 673), reused 1111 (delta 639)
Receiving objects: 100% (1169/1169), 400.80 KiB | 502 KiB/s, done.
Resolving deltas: 100% (673/673), done.
[abhik:/Downloads/blist]$ python setup.py build_ext --inplace
Downloading http://pypi.python.org/packages/source/d/distribute/distribute-0.6.12.tar.gz
Extracting in /var/folders/Tv/Tv70U9x4EJ8RVIsTgph+fE+++TI/-Tmp-/tmprwLwFu
Now working in /var/folders/Tv/Tv70U9x4EJ8RVIsTgph+fE+++TI/-Tmp-/tmprwLwFu/distribute-0.6.12
Building a Distribute egg in /Users/abhik/Downloads/blist
/Users/abhik/Downloads/blist/distribute-0.6.12-py2.7.egg
running build_ext
building '_blist' extension
creating build
creating build/temp.macosx-10.6-x86_64-2.7
/usr/bin/gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -pipe -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -DBLIST_FLOAT_RADIX_SORT=1 -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c _blist.c -o build/temp.macosx-10.6-x86_64-2.7/_blist.o
creating build/lib.macosx-10.6-x86_64-2.7
/usr/bin/gcc-4.2 -L/opt/local/lib -bundle -undefined dynamic_lookup -L/opt/local/lib build/temp.macosx-10.6-x86_64-2.7/_blist.o -o build/lib.macosx-10.6-x86_64-2.7/_blist.so
copying build/lib.macosx-10.6-x86_64-2.7/_blist.so ->
[abhik:~/Downloads/blist]$ python
Python 2.7.1 (r271:86832, Mar 8 2011, 16:24:29)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import blist
x = blist.blist([0]) * 4**20
x[10]
Segmentation fault

Thanks!

Provide a sortedfrozenset type

Replace "leaf" with "height"

Height == 0 indicates a leaf
Height == 1 indicates the parent of a leaf
etc.

"height" uses the same memory footprint as "leaf" and requires the same number of CPU instructions to test for leafiness.

"height" will speed up a few operations slightly. Specifically, the extend operation would become O(|log n - log k|), an improvement over O(log n + log k). Many other operations internally use "extend" and would get a small constant-factor speedup.

Finally, "height" will allow for additional error-checking in debug builds, since we can check the following invariant: the height of any non-leaf node's children must equal the node's height minus one.

Changing an existing key in a sorteddict

sorteddict.__setitem__ works by

self._sortedkeys.add(key)
self._map[key] = value

Where _sortedkeys is a sortedset and _map is a dict. In the case where key is already in the sorteddict and you are merely changing its value, an operation that could be performed in constant time takes O(log n) time. Code like

if key in self._map:
    self._sortedkeys.add(key)
self._map[key] = value

should improve the asymptotic performance in this case.

Consider renaming classes to be consistent with PEP 8

The Style Guide for Python Code, PEP 8, says:

Almost without exception, class names use the CapWords convention.
Classes for internal use have a leading underscore in addition.

sorteddict.repr should be sorted

Since one use case for sorted collections like sorteddict is to get reproducible behavior independent of hash randomization or the insertion order, it would be useful if sorteddict.repr sorted the keys as well.

Test case:

class Collider(object):
  def __init__(self, x):
    self.x = x
  def __repr__(self):
    return 'Collider(%r)' % self.x
  def __eq__(self, other):
    return self.__class__ == other.__class__ and self.x == other.x
  def __cmp__(self, other):
    if self.__class__ != other.__class__:
      return NotImplemented
    return cmp(self.x, other.x)
  def __hash__(self):
    return 42

>>> blist.sorteddict({Collider(1): 1, Collider(2): 2})
sorteddict({Collider(1): 1, Collider(2): 2})
>>> blist.sorteddict({Collider(2): 2, Collider(1): 1})
sorteddict({Collider(2): 2, Collider(1): 1})
# expected: sorteddict({Collider(1): 1, Collider(2): 2})

The blist type should register as a MutableSequence

Allow popping from end of sortedlist

Currently you can only cal sortedlist.pop() to get the first element of a list. I would like to be able to pop from the end as well as the beginning of a list.

Add a btuple type

A btuple is a read-only version of the blist and uses the same data structures and principally the same algorithms. It can be created efficiently from an existing blist and taking a slice of a btuple is fast (and returns a new btuple--not a blist). Unlike a blist, a btuple is hashable, allowing it to be used as a key for a dictionary.

bug in _sorteddict.KeysView

Example:

import blist
d = blist.sorteddict((1,1),(2,2))
e = blist.sorteddict((1,1),(3,3))
d.keys() & e.keys()

fails with AttributeError. The culprit is the following method in _sorteddict.KeysView:

def _from_iterable(cls, it):
    return sortedset(key=self._mapping._sortedkeys.key)

First, this method is probably meant to be @classmethod. Second, the call to sortedset needs it as the first argument. Third, _mapping is only available from an instance, which we don't have here.

After the (trivial) fix to address the first two issues, the code I showed above works fine.

However, if sorteddict instance uses a custom sort key, it is lost. I don't know how to fix this. Possibly it shouldn't be fixed; what sort key should be used if the two sorteddict have different sort keys? Using the one from the LHS is quite arbitrary, even if we could retrieve it. I really don't know what the right approach is here.

Add a sorteddict type

Analogous to the sortedset type. It works like a dict but .keys always returns the keys in sorted order. Under-the-hood, it's stores both a hash table (to make lookups O(1)) and a blist.

blist.sorteddict should be lazy about modifying the sorted list

In some cases, adding and deleting the elements then calling .sort() may be more efficient than inserting as we go.

Python 3.3.1 Release For Windows / How To Build

Hi,

I think we need a MS Windows Installer for Python 3.3, which is the current release of Python. I'd like to know how to build the module on Windows, as well.

Thanks.

Appending to a list during iteration behaves differently than list()

Modifying a list during iteration is undefined by the Python documentation and generally inadvisable. Some people (including the 2to3 tool) rely on CPython's behavior for appending during iteration.

x = list(range(10))
for i in x:
    if i > 100: break
    x.append(i+1)
print len(x)

A regular python list prints 939. With a blist, it prints 39 (as of blist 1.0.2).

Issues on non-ix86 architectures

The three patches in my form of blist address the following issues; a pull request has been made.

fastcall-ix86-only: this one actually just fixes some warnings: the
fastcall convention for GCC really only makes sense on ix86 (IA-32),
as on all our other supported platforms (x86_64, ppc, ppc64) the
default convention is already to put the first few arguments in
registers

remove-unsafe-tests: disable some tests that use out-of-range
integers: on non-ix86, they cause segmentation faults

use-upstream-macros: for some reason, Py_RETURN_TRUE and
Py_RETURN_FALSE are redefined, after they are already used in the file
(but with the same definition as the original). This patch removes
them

Make blist-nodes circular arrays

For a constant-factor performance improvement, each blist node could grow a integer "start" field indicating an offset to the first child. The children would be wrapped, so that for a blist with a LIMIT of 8, a start of 4, and 7 children, their order would be: 4, 5, 6, 7, 0, 1, 2.

That would dramatically decrease the cost of inserting near the beginning of a node, especially if LIMIT is large. For FIFO operation, blist should be able to soundly beat the list and compete head-to-head with the deque.

sortedlist should support an extend() operation

As far as I can tell, there is no way to concatenate two sortedlist's. Neither the list style extend() nor the set style union() are present.

Add weakkeysorteddict, weakvaluesorteddict, weakkeyordereddict, weakvalueordereddict, and weakorderedset

First of all, blist is amazing. It has saved me a lot of time.

There are still something classes that are not implemented, for example weakkeyordereddict: like sorteddict, but instead of taking a key parameter, it uses the insertion order to come up with the order.

Feel free to use my code if it helps with implementing the "ordered" data structures: (It would be amazing to have this internally within blist)
http://stackoverflow.com/questions/7828444/indexable-weak-ordered-set-in-python

If this happens (I will be ecstatic) and it might also be worth exposing all of blist's dictionaries under one factory class as in:
http://stackoverflow.com/questions/6602816/why-arent-python-dicts-unified

blist indexing is not working

The code:

import blist
d = blist.sorteddict({3: "first", 7: "second"})
print(d.keys()[1])
print(d.values()[1])
print(d.items()[1])

gives the error

Traceback (most recent call last):
  File "test.py", line 4, in <module>
    print(d.values()[1])
  File "/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/_sorteddict.py", line 53, in __getitem__
    key = self._mapping.sortedkeys[index]
AttributeError: 'sorteddict' object has no attribute 'sortedkeys'

When this is fixed by adding the missing underscore to the name, another error pops up for the attribute "key". When that is fixed, it seems that the ItemsView isn't indexing properly.

class MyClass(object):
    def __radd__(self, other):
        return 7

If I create a new MyClass object and get:

>>> c = Coo()
>>> () + Coo()
7
>>> [] + Coo()
7
>>> blist.blist() + Coo()
7
>>> blist.btuple() + Coo()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.6/dist-packages/blist-1.3.4-py2.6-linux-i686.egg/_btuple.py", line 47, in __add__
    raise TypeError
TypeError

This happens because the add function returns raises TypeError instead of returning NotImplemeted, so the other object's radd is not called.

For certain sequences of operations, getitem is amortized Theta(log n) instead of Theta(1)

L.insert(0, item)
del L[0]
for i in range(len(L)//2):
L[i]

danielstutzbach / blist Goto Github PK

blist's People

Stargazers

Watchers

Forkers

blist's Issues

In [14]: d.items()[0]

Recommend Projects

Recommend Topics

Recommend Org

Jobs