danielstutzbach / blist Goto Github PK
View Code? Open in Web Editor NEWA list-like type with better asymptotic performance and similar performance on small lists
License: Other
A list-like type with better asymptotic performance and similar performance on small lists
License: Other
They currently fall back to the default object implementation, returning a value that is too small.
When subclassing sorteddict
I defined __slots__
attribute and realized that it was futile, because -- although sorteddict have __slots__
, it's meaningless, because some superclasses (from _abcoll
...) have not. Some other collections don't have __slots__
at all... It'd be nice to be able make use of __slots__
... (sometimes there are plenty of collection instances and that's my case).
Python 3.1.2 (release31-maint, Sep 26 2010, 13:51:01)
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from blist import *
>>> bl = blist()
>>> bl.spam = 3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: '_blist.blist' object has no attribute 'spam'
>>>
>>> bt = btuple()
>>> bt.spam = 3
>>> btuple.__slots__
['_hash', '_blist']
>>>
>>> sd = sorteddict()
>>> sd.spam = 3
>>> sorteddict.__slots__
['_sortedkeys', '_map']
>>>
>>> sl = sortedlist()
>>> sl.spam = 3
>>> sl.__slots__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'sortedlist' object has no attribute '__slots__'
>>>
>>> s = sortedset()
>>> s.spam = 3
>>> sortedset.__slots__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: type object 'sortedset' has no attribute '__slots__'
>>>
>>> wsl = weaksortedlist()
>>> wsl.spam = 3
>>> wsl.__slots__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'weaksortedlist' object has no attribute '__slots__'
>>>
>>> ws = weaksortedset()
>>> ws.spam = 3
>>> ws.__slots__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'weaksortedset' object has no attribute '__slots__'
Something B+ trees are used for a lot is to answer the question "which items satisfy p >= start and p < stop
" - this I reckon should be possible to answer with the blist
implementation but is not possible via the APIs exposed it seems?
The usecase is using blist to make a database indexing engine.
sortedlist should support a peek() operation so the next value to be pop()ed can be previewed before popping
I assume this is categorized as list compatibility
.
class CustomList(list):
def __div__(self, count):
return 5
>>> l= CustomList()
>>> l/1
5
class CustomBlist(blist.blist):
def __div__(self, count):
return 4
>>> l= CustomBlist()
>>> l/1
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for /: 'CustomBlist' and 'int'
A C implementation of sorted* could give O(log n) add and remove methods. It remains to be seen if that would really boost performance, as the number of comparisons is already O(log n) and the tree can't be that tall anyway -- max height is now 16 in the C source.
Right now, the blist has to perform a lot of checking after each comparison operation to make sure the comparison function did not permute the list. We could save a lot of time by making an O(1) copy first.
I'm not sure if that would pay off for small lists (where the root is also a leaf), but it might.
For long lists, it should be a clear win. We should probably be careful to avoid starting the cache in the root which would use O(n) memory.
It should copy the internal blist structure, making it very fast when copy-on-write is being heavily used. Observe how list() handles it:
>>> class blah:
... pass
...
>>> x = blah()
>>> y = [x,x]
>>> z = copy.deepcopy(y)
>>> id(z[0]), id(z[1])
(2146287308, 2146287308)
>>> id(x)
2146287244
the performance on sorting can be inferior at standard python list (from an example of performance of timsort)
>>> import blist
>>> l = [i for i in range(1048576)]
>>> for i in range(3):
import random
r1 = random.randint(0,len(l)-1)
r2 = random.randint(0,len(l)-1)
l[r1],l[r2] = l[r2],l[r1]
>>> bl = blist.blist(l)
>>> def t(l):
import time
before = time.time()
l.sort()
return time.time() - before
>>> t(l)
0.04699993133544922
>>> t(bl)
0.28100013732910156
In [11]: d = blist.sorteddict()
In [12]: d[1] = 2
In [13]: d.items()
Out[13]: ItemsView(sorteddict({1: 2}))
AttributeError Traceback (most recent call last)
in ()
----> 1 d.items()[0]
/usr/local/lib/python3.3/dist-packages/blist-1.3.4-py3.3-linux-x86_64.egg/_sorteddict.py in getitem(self, index)
30 return self._from_iterable((key, self._mapping[key])
31 for key in keys)
---> 32 key = self._mapping.sortedkeys[index]
33 return (key, self._mapping[key])
34 def index(self, item):
AttributeError: 'sorteddict' object has no attribute 'sortedkeys'
This bug is also signaled on stackoverflow: http://stackoverflow.com/questions/11285312/indexing-the-valuesview-of-a-sorteddict .
For read and delete operations, works just like a list.
For insert operations, works just like a set.
The constructor should take an optional "key" function.
Ubuntu 12.04
$ sudo pip install btree
Downloading/unpacking btree
Running setup.py egg_info for package btree
Installing collected packages: btree
Running setup.py install for btree
btree.c:35:19: fatal error: btree.h: No such file or directory
compilation terminated.
building 'btree' extension
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c btree.c -o build/temp.linux-i686-2.7/btree.o
Traceback (most recent call last):
File "", line 1, in
File "/home/username/build/btree/setup.py", line 7, in
paver.tasks.main()
File "paver-minilib.zip/paver/tasks.py", line 621, in main
File "paver-minilib.zip/paver/tasks.py", line 604, in _launch_pavement
File "paver-minilib.zip/paver/tasks.py", line 569, in _process_commands
File "paver-minilib.zip/paver/setuputils.py", line 146, in call
File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/dist-packages/setuptools/command/install.py", line 53, in run
return _install.run(self)
File "/usr/lib/python2.7/distutils/command/install.py", line 601, in run
self.run_command('build')
File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/distutils/command/build.py", line 128, in run
self.run_command(cmd_name)
File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/dist-packages/setuptools/command/build_ext.py", line 46, in run
_build_ext.run(self)
File "/usr/lib/python2.7/distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/usr/lib/python2.7/distutils/command/build_ext.py", line 448, in build_extensions
self.build_extension(ext)
File "/usr/lib/python2.7/dist-packages/setuptools/command/build_ext.py", line 182, in build_extension
_build_ext.build_extension(self,ext)
File "/usr/lib/python2.7/distutils/command/build_ext.py", line 498, in build_extension
depends=ext.depends)
File "/usr/lib/python2.7/distutils/ccompiler.py", line 572, in compile
self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
File "/usr/lib/python2.7/distutils/unixccompiler.py", line 180, in _compile
raise CompileError, msg
distutils.errors.CompileError: command 'gcc' failed with exit status 1
Complete output from command /usr/bin/python -c "import setuptools;file='/home/username/build/btree/setup.py';exec(compile(open(file).read().replace('\r\n', '\n'), file, 'exec'))" install --single-version-externally-managed --record /tmp/pip-NmMbGC-record/install-record.txt:
btree.c:35:19: fatal error: btree.h: No such file or directory
compilation terminated.
running install
running build
running build_ext
building 'btree' extension
gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -I/usr/include/python2.7 -c btree.c -o build/temp.linux-i686-2.7/btree.o
Traceback (most recent call last):
File "", line 1, in
File "/home/username/build/btree/setup.py", line 7, in
paver.tasks.main()
File "paver-minilib.zip/paver/tasks.py", line 621, in main
File "paver-minilib.zip/paver/tasks.py", line 604, in _launch_pavement
File "paver-minilib.zip/paver/tasks.py", line 569, in _process_commands
File "paver-minilib.zip/paver/setuputils.py", line 146, in call
File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/dist-packages/setuptools/command/install.py", line 53, in run
return _install.run(self)
File "/usr/lib/python2.7/distutils/command/install.py", line 601, in run
self.run_command('build')
File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/distutils/command/build.py", line 128, in run
self.run_command(cmd_name)
File "/usr/lib/python2.7/distutils/cmd.py", line 326, in run_command
self.distribution.run_command(command)
File "/usr/lib/python2.7/distutils/dist.py", line 972, in run_command
cmd_obj.run()
File "/usr/lib/python2.7/dist-packages/setuptools/command/build_ext.py", line 46, in run
_build_ext.run(self)
File "/usr/lib/python2.7/distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/usr/lib/python2.7/distutils/command/build_ext.py", line 448, in build_extensions
self.build_extension(ext)
File "/usr/lib/python2.7/dist-packages/setuptools/command/build_ext.py", line 182, in build_extension
_build_ext.build_extension(self,ext)
File "/usr/lib/python2.7/distutils/command/build_ext.py", line 498, in build_extension
depends=ext.depends)
File "/usr/lib/python2.7/distutils/ccompiler.py", line 572, in compile
self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
File "/usr/lib/python2.7/distutils/unixccompiler.py", line 180, in _compile
raise CompileError, msg
distutils.errors.CompileError: command 'gcc' failed with exit status 1
Command /usr/bin/python -c "import setuptools;file='/home/username/build/btree/setup.py';exec(compile(open(file).read().replace('\r\n', '\n'), file, 'exec'))" install --single-version-externally-managed --record /tmp/pip-NmMbGC-record/install-record.txt failed with error code 1
Storing complete log in /home/username/.pip/pip.log
Although sortedlist provides methods to obtain the appropriate index to introduce a new element in pre-order and post-order (a.k.a. bisect_left and bisect_right), it is impossible to use those methods to introduce a new element in the desired ordered position, because the 'add' method only supports post-order. To achieve this, at this time you need to access to protected attributes and methods, which is not very convenient.
The following should throw an exception:
x = blist(range(10))
while True:
x.extend(x)
Right now, a full node is allocated regardless of the list size. For very small lists (where the root is a leaf), we should mimic the behavior of Python's built-in list.
Right now, small lists allocate enough space for LIMIT children, even if only one or two are actually needed. For small lists (where the root is a leaf), the memory footprint is terrible compared to an array-based list. The root needs to be able to dynamically choose the number of children. However, it needs to be very careful because hitherto a root node has been treated as a subclass of a regular node. With this change, routines cannot assume that any node has space for LIMIT children.
Would be great to see a case insensitive version of sortedset
Doing this now for sorting is easy because you have the key
parameter (sortedlist(foo, key=lambda x: x.lower())
)
However in those cases you also want to de-dupe in a case insensitive way, which it doesn't do.
If there's an easy way to accomplish this now, I'd love to hear it. The best I've come up with so far is:
class isortedset(sortedset):
def __contains__(self, key):
return key.lower() in (n.lower() for n in self)
but this breaks the nice time complexity
I have scripts to check that my servers have the correct installation of modules. Part of that check is to confirm the installed version of the modules is the same as what was tested against. It alerts me if new versions have been released so I can inspect the changes.
The blist module does not appear to share its own version number, making such checks more problematic.
PEP396 describes a standard mechanism for providing version numbers in Python.
Please consider adhering to PEP396 for future releases.
Trying to use a blist of size 4**20 and I get segmentation fault upon first read/write to blist.
Using latest version from git.
Running on Mac OSX on x86_64 with latest Xcode/gcc install.
Commands to replicate error:
[abhik:/Downloads]$ git clone https://github.com/DanielStutzbach/blist.git/Downloads/blist]$ python setup.py build_ext --inplace
Cloning into blist...
remote: Counting objects: 1169, done.
remote: Compressing objects: 100% (523/523), done.
remote: Total 1169 (delta 673), reused 1111 (delta 639)
Receiving objects: 100% (1169/1169), 400.80 KiB | 502 KiB/s, done.
Resolving deltas: 100% (673/673), done.
[abhik:
Downloading http://pypi.python.org/packages/source/d/distribute/distribute-0.6.12.tar.gz
Extracting in /var/folders/Tv/Tv70U9x4EJ8RVIsTgph+fE+++TI/-Tmp-/tmprwLwFu
Now working in /var/folders/Tv/Tv70U9x4EJ8RVIsTgph+fE+++TI/-Tmp-/tmprwLwFu/distribute-0.6.12
Building a Distribute egg in /Users/abhik/Downloads/blist
/Users/abhik/Downloads/blist/distribute-0.6.12-py2.7.egg
running build_ext
building '_blist' extension
creating build
creating build/temp.macosx-10.6-x86_64-2.7
/usr/bin/gcc-4.2 -fno-strict-aliasing -fno-common -dynamic -pipe -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -DBLIST_FLOAT_RADIX_SORT=1 -I/opt/local/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c _blist.c -o build/temp.macosx-10.6-x86_64-2.7/_blist.o
creating build/lib.macosx-10.6-x86_64-2.7
/usr/bin/gcc-4.2 -L/opt/local/lib -bundle -undefined dynamic_lookup -L/opt/local/lib build/temp.macosx-10.6-x86_64-2.7/_blist.o -o build/lib.macosx-10.6-x86_64-2.7/_blist.so
copying build/lib.macosx-10.6-x86_64-2.7/_blist.so ->
[abhik:~/Downloads/blist]$ python
Python 2.7.1 (r271:86832, Mar 8 2011, 16:24:29)
[GCC 4.2.1 (Apple Inc. build 5664)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import blist
x = blist.blist([0]) * 4**20
x[10]
Segmentation fault
Thanks!
Height == 0 indicates a leaf
Height == 1 indicates the parent of a leaf
etc.
"height" uses the same memory footprint as "leaf" and requires the same number of CPU instructions to test for leafiness.
"height" will speed up a few operations slightly. Specifically, the extend operation would become O(|log n - log k|), an improvement over O(log n + log k). Many other operations internally use "extend" and would get a small constant-factor speedup.
Finally, "height" will allow for additional error-checking in debug builds, since we can check the following invariant: the height of any non-leaf node's children must equal the node's height minus one.
sorteddict.__setitem__
works by
self._sortedkeys.add(key)
self._map[key] = value
Where _sortedkeys
is a sortedset
and _map
is a dict
. In the case where key is already in the sorteddict and you are merely changing its value, an operation that could be performed in constant time takes O(log n) time. Code like
if key in self._map:
self._sortedkeys.add(key)
self._map[key] = value
should improve the asymptotic performance in this case.
The Style Guide for Python Code, PEP 8, says:
Almost without exception, class names use the CapWords convention.
Classes for internal use have a leading underscore in addition.
Since one use case for sorted collections like sorteddict is to get reproducible behavior independent of hash randomization or the insertion order, it would be useful if sorteddict.repr sorted the keys as well.
Test case:
class Collider(object):
def __init__(self, x):
self.x = x
def __repr__(self):
return 'Collider(%r)' % self.x
def __eq__(self, other):
return self.__class__ == other.__class__ and self.x == other.x
def __cmp__(self, other):
if self.__class__ != other.__class__:
return NotImplemented
return cmp(self.x, other.x)
def __hash__(self):
return 42
>>> blist.sorteddict({Collider(1): 1, Collider(2): 2})
sorteddict({Collider(1): 1, Collider(2): 2})
>>> blist.sorteddict({Collider(2): 2, Collider(1): 1})
sorteddict({Collider(2): 2, Collider(1): 1})
# expected: sorteddict({Collider(1): 1, Collider(2): 2})
Currently you can only cal sortedlist.pop() to get the first element of a list. I would like to be able to pop from the end as well as the beginning of a list.
A btuple is a read-only version of the blist and uses the same data structures and principally the same algorithms. It can be created efficiently from an existing blist and taking a slice of a btuple is fast (and returns a new btuple--not a blist). Unlike a blist, a btuple is hashable, allowing it to be used as a key for a dictionary.
Example:
import blist
d = blist.sorteddict((1,1),(2,2))
e = blist.sorteddict((1,1),(3,3))
d.keys() & e.keys()
fails with AttributeError
. The culprit is the following method in _sorteddict.KeysView
:
def _from_iterable(cls, it):
return sortedset(key=self._mapping._sortedkeys.key)
First, this method is probably meant to be @classmethod
. Second, the call to sortedset
needs it
as the first argument. Third, _mapping
is only available from an instance, which we don't have here.
After the (trivial) fix to address the first two issues, the code I showed above works fine.
However, if sorteddict
instance uses a custom sort key, it is lost. I don't know how to fix this. Possibly it shouldn't be fixed; what sort key should be used if the two sorteddict have different sort keys? Using the one from the LHS is quite arbitrary, even if we could retrieve it. I really don't know what the right approach is here.
Analogous to the sortedset type. It works like a dict but .keys always returns the keys in sorted order. Under-the-hood, it's stores both a hash table (to make lookups O(1)) and a blist.
In some cases, adding and deleting the elements then calling .sort() may be more efficient than inserting as we go.
Hi,
I think we need a MS Windows Installer for Python 3.3, which is the current release of Python. I'd like to know how to build the module on Windows, as well.
Thanks.
Modifying a list during iteration is undefined by the Python documentation and generally inadvisable. Some people (including the 2to3 tool) rely on CPython's behavior for appending during iteration.
x = list(range(10))
for i in x:
if i > 100: break
x.append(i+1)
print len(x)
A regular python list prints 939. With a blist, it prints 39 (as of blist 1.0.2).
The three patches in my form of blist address the following issues; a pull request has been made.
fastcall-ix86-only: this one actually just fixes some warnings: the
fastcall convention for GCC really only makes sense on ix86 (IA-32),
as on all our other supported platforms (x86_64, ppc, ppc64) the
default convention is already to put the first few arguments in
registers
remove-unsafe-tests: disable some tests that use out-of-range
integers: on non-ix86, they cause segmentation faults
use-upstream-macros: for some reason, Py_RETURN_TRUE and
Py_RETURN_FALSE are redefined, after they are already used in the file
(but with the same definition as the original). This patch removes
them
For a constant-factor performance improvement, each blist node could grow a integer "start" field indicating an offset to the first child. The children would be wrapped, so that for a blist with a LIMIT of 8, a start of 4, and 7 children, their order would be: 4, 5, 6, 7, 0, 1, 2.
That would dramatically decrease the cost of inserting near the beginning of a node, especially if LIMIT is large. For FIFO operation, blist should be able to soundly beat the list and compete head-to-head with the deque.
As far as I can tell, there is no way to concatenate two sortedlist's. Neither the list style extend() nor the set style union() are present.
First of all, blist is amazing. It has saved me a lot of time.
There are still something classes that are not implemented, for example weakkeyordereddict: like sorteddict, but instead of taking a key parameter, it uses the insertion order to come up with the order.
Feel free to use my code if it helps with implementing the "ordered" data structures: (It would be amazing to have this internally within blist)
http://stackoverflow.com/questions/7828444/indexable-weak-ordered-set-in-python
If this happens (I will be ecstatic) and it might also be worth exposing all of blist's dictionaries under one factory class as in:
http://stackoverflow.com/questions/6602816/why-arent-python-dicts-unified
The code:
import blist
d = blist.sorteddict({3: "first", 7: "second"})
print(d.keys()[1])
print(d.values()[1])
print(d.items()[1])
gives the error
Traceback (most recent call last):
File "test.py", line 4, in <module>
print(d.values()[1])
File "/opt/local/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/_sorteddict.py", line 53, in __getitem__
key = self._mapping.sortedkeys[index]
AttributeError: 'sorteddict' object has no attribute 'sortedkeys'
When this is fixed by adding the missing underscore to the name, another error pops up for the attribute "key". When that is fixed, it seems that the ItemsView isn't indexing properly.
PEP396 describes how to specify version numbers in packages.
Having such version numbers would allow programmatic checks that the blist version is what is expected by the app.
Unlike the rest of the blist API, you cannot pass a custom key function to the sorteddict constructor using a keyword argument. It fails because it takes key to be an item in the dict: sorteddict(key=str.lower)
The example of correct usage, sorteddict(str.lower)
, is mentioned in the docs. But the doc does not say that the key must always be positional and never specified as a keyword. I would suggest adding a note to that effect.
sortedset describes index as taking linear time โ it should describe it as logarithmic time, right?
I'd like to know the index at which an item would be inserted - similar to bisect_left - I can't find a way to do this without getting ValueError.
I want to add some functionality using radd .
class MyClass(object):
def __radd__(self, other):
return 7
If I create a new MyClass object and get:
>>> c = Coo()
>>> () + Coo()
7
>>> [] + Coo()
7
>>> blist.blist() + Coo()
7
>>> blist.btuple() + Coo()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.6/dist-packages/blist-1.3.4-py2.6-linux-i686.egg/_btuple.py", line 47, in __add__
raise TypeError
TypeError
This happens because the add function returns raises TypeError instead of returning NotImplemeted, so the other object's radd is not called.
L.insert(0, item)
del L[0]
for i in range(len(L)//2):
L[i]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.