ascv / hyperloglog Goto Github PK
View Code? Open in Web Editor NEWFast HyperLogLog for Python.
License: MIT License
Fast HyperLogLog for Python.
License: MIT License
I'm running on Mac OS X 10.10.1 and had an issue importing the package after installing:
Python 2.7.8 (default, Dec 10 2014, 14:56:51)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import HLL
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: dlopen(/usr/local/lib/python2.7/site-packages/HLL.so, 2): Symbol not found: _PyInt_AS_LONG
Referenced from: /usr/local/lib/python2.7/site-packages/HLL.so
Expected in: flat namespace
in /usr/local/lib/python2.7/site-packages/HLL.so
Output from installation:
λ pip install HLL
Downloading/unpacking HLL
Downloading HLL-0.833.tar.gz
Running setup.py (path:/private/var/folders/km/n8n4rfqs3_sbcrt7zd8bvyww0000gn/T/pip_build_jmalina/HLL/setup.py) egg_info for package HLL
Installing collected packages: HLL
Running setup.py install for HLL
building 'HLL' extension
clang -fno-strict-aliasing -fno-common -dynamic -I/usr/local/include -I/usr/local/opt/sqlite/include -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c hll.c -o build/temp.macosx-10.10-x86_64-2.7/hll.o
hll.c:198:26: warning: implicit declaration of function 'PyByteArray_AsString' is invalid in C99 [-Wimplicit-function-declaration]
char *hllRegisters = PyByteArray_AsString(hllByteArray);
^
hll.c:198:11: warning: incompatible integer to pointer conversion initializing 'char *' with an expression of type 'int' [-Wint-conversion]
char *hllRegisters = PyByteArray_AsString(hllByteArray);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
hll.c:218:17: warning: implicit declaration of function 'PyByteArray_FromStringAndSize' is invalid in C99 [-Wimplicit-function-declaration]
registers = PyByteArray_FromStringAndSize(self->registers, self->size);
^
hll.c:218:15: warning: incompatible integer to pointer conversion assigning to 'PyObject *' (aka 'struct _object *') from 'int' [-Wint-conversion]
registers = PyByteArray_FromStringAndSize(self->registers, self->size);
^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
hll.c:234:15: warning: comparison of unsigned expression < 0 is always false [-Wtautological-compare]
if (index < 0) {
~~~~~ ^ ~
hll.c:252:14: warning: comparison of unsigned expression < 0 is always false [-Wtautological-compare]
if (rank < 0) {
~~~~ ^ ~
6 warnings generated.
clang -fno-strict-aliasing -fno-common -dynamic -I/usr/local/include -I/usr/local/opt/sqlite/include -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c murmur3.c -o build/temp.macosx-10.10-x86_64-2.7/murmur3.o
murmur3.c:21:15: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier]
static inline FORCE_INLINE uint32_t rotl32 ( uint32_t x, int8_t r )
^
murmur3.c:16:53: note: expanded from macro 'FORCE_INLINE'
#define FORCE_INLINE __attribute__((always_inline)) inline
^
murmur3.c:26:15: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier]
static inline FORCE_INLINE uint64_t rotl64 ( uint64_t x, int8_t r )
^
murmur3.c:16:53: note: expanded from macro 'FORCE_INLINE'
#define FORCE_INLINE __attribute__((always_inline)) inline
^
murmur3.c:45:15: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier]
static inline FORCE_INLINE uint32_t fmix32 ( uint32_t h )
^
murmur3.c:16:53: note: expanded from macro 'FORCE_INLINE'
#define FORCE_INLINE __attribute__((always_inline)) inline
^
murmur3.c:58:15: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier]
static inline FORCE_INLINE uint64_t fmix64 ( uint64_t k )
^
murmur3.c:16:53: note: expanded from macro 'FORCE_INLINE'
#define FORCE_INLINE __attribute__((always_inline)) inline
^
murmur3.c:26:37: warning: unused function 'rotl64' [-Wunused-function]
static inline FORCE_INLINE uint64_t rotl64 ( uint64_t x, int8_t r )
^
murmur3.c:58:37: warning: unused function 'fmix64' [-Wunused-function]
static inline FORCE_INLINE uint64_t fmix64 ( uint64_t k )
^
6 warnings generated.
clang -bundle -undefined dynamic_lookup -L/usr/local/lib -L/usr/local/opt/sqlite/lib -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk build/temp.macosx-10.10-x86_64-2.7/hll.o build/temp.macosx-10.10-x86_64-2.7/murmur3.o -o build/lib.macosx-10.10-x86_64-2.7/HLL.so
Successfully installed HLL
Cleaning up...
This will allow the user to set the initial array of ranks and merge cardinality estimators.
What kind of accuracy should I expect? Even using k=16 I get very poor results (error ~45%) after adding 1024 items:
$ cat hll.py
from HLL import HyperLogLog
hll = HyperLogLog(16)
for x in range(1024):
hll.add(str(x))
print hll.cardinality()
$ python hll.py
1481.65524266
First of all, this library is great. Thank you!
I wanted to suggest two features found in some other HLL libraries that (as far as I can tell) are missing here. It is possible to compute an estimated cardinality of the intersection between two HLLs and an estimated similarity between two HLLs.
Here is a Go library that focuses on these two features: https://github.com/axiomhq/hyperminhash
It would be awesome to have C implementations of these operations exposed to python as part of the HyperLogLog object.
hll1.intersection(hll2)
hll1.similarity(hll2)
HI,
I wanna store the hll
result into db, is there a way to get concise integer or string form of the hll instance?
thanks
hll1 = HyperLogLog(16)
hll2 = HyperLogLog(16)
for i in range(100000):
hll1.add(str(random.random()))
for i in range(100000):
hll2.add(str(random.random()))
for i in range(100000):
r = str(random.random())
hll1.add(r)
hll2.add(r)
print(hll1.cardinality())
print(hll2.cardinality())
hll1.merge(hll2)
print(hll1.cardinality())
gives
201675.90068912748
201259.4273446178
201675.90068912748
the last one should be ~300k.
After reading the leading comments in this file, I think the illustration in line 48 was wrong, the byte after b1
should be b2
rather than b3
.
/* ========================== Dense representation ========================= */
/*
* Since register values will never exceed 64 we store them using only 6 bits.
* This encoding is diagrammed below:
*
* b0 b1 b3 b4
* / / / /
* +-------------------+---------+---------+
* |0000 0011|1111 0011|0110 1110|1111 1011|
* +-------------------+---------+---------+
* |_____||_____| |_____||_____| |_____|
* | | | | |
* offset m1 m2 m3 m4
*
* b = bytes, m = registers
*
And I fixed this little issue in this PR: #41
Actually, that's barely an issue :P
Thanks for the great work! I'm still learning the HLL algorithm.
I have a use case where I need to serialize a HLL then unserialize it later. I used registers()
and set_registers()
for this. However, due to the nature of the serialization, the registers were converted from a bytearray
to a bytes
. Not realising this, I attempted to call set_registers
with this value, and my program promptly segfaulted due to this code:
registers = PyByteArray_AsString((PyObject*) regs);
self->use_cache = 0;
int i;
for (i = 0; i < self->size; i++) {
self->registers[i] = registers[i];
}
because the result of PyByteArray_AsString
is not checked.
The expected behaviour when passed an inappropriate type would be a TypeError or other exception. Better yet would be to accept a bytes
as input to this function.
In line 240 of hll.c, there is a check that index is within size.
if (index > self->size) {
char * msg = "Index greater than the number of registers.";
PyErr_SetString(PyExc_IndexError, msg);
return NULL;
}
Should that not be index >= self->size
?
The small range correction is returning estimates that are abnormally off for a linear count.
the HLL struct is construct in c++ and record in the data stream, made a python script read this the data and decode out the cardinality of the hll data, is there any interface to do this?
The method hasn't been updated to handle sparse vs dense representations.
from HLL import HyperLogLog
import pickle
import os
import psutil
process = psutil.Process(os.getpid())
def print_memory_usage():
B = process.memory_info().rss
MB = B / (1<<20)
print(f'{MB} MB')
for i in range(10000):
hll = HyperLogLog(14)
pickle.dumps(hll)
if i % 1000 == 0:
print_memory_usage()
46.69140625 MB
78.109375 MB
109.5625 MB
140.7578125 MB
172.32421875 MB
203.77734375 MB
235.23046875 MB
266.42578125 MB
297.87890625 MB
329.33203125 MB
Without pickle.dumps
46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB
Python version: 3.7.3
Add program description, list of commands, example usage, and an explanation of the algorithm.
Hi,
First of all, thank you very much for this implementation. While playing with the library, I found out that serializing and deserializing a HyperLogLog object and then merging it to another leads to a big drop in accuracy. Here is the code to reproduce:
Python: 3.9.16
HLL: 2.0.3
from HLL import HyperLogLog
import random
import pickle
random.seed(0)
def test_union_precision(serde=False):
union_count = 1000
candidate_values = [str(i) for i in range(100_000)]
picked_values = set()
agg_hll = HyperLogLog(p=8, seed = 0)
for _ in range(union_count):
hll = HyperLogLog(p=8, seed = 0)
values = random.sample(candidate_values, k=random.randint(0, 100))
picked_values.update(values)
for v in values:
hll.add(v)
if(serde):
hll = pickle.loads(pickle.dumps(hll))
agg_hll.merge(hll)
deviation = agg_hll.cardinality()/len(picked_values)
return deviation
print(test_union_precision(serde=False), test_union_precision(serde=True))
gives
1.048 0.130
I have seen the issues resolved previously and indeed my registers are all the same before and after serialization/deserialization so I suspect the error to be somewhere else but I am not familiar enough with the codebase to find it.
Thank you in advance for your help
I'm getting getting estimates which are way off. Here's some simple code to show the problem:
import HLL
from random import choice
#Sample 5000 numbers from [0..1000], i.e. ~20% uniques
possible = range(1000)
sample = [choice(possible) for _ in range(5000)] #sample with repetition
print "True cardinality: {}".format(len(set(sample)))
hll = HLL.HyperLogLog(14)
for s in sample:
hll.add(str(s))
print "Estimated cardinality: {}".format(hll.cardinality())
which outputs:
True cardinality: 987
Estimated cardinality: 1431.81903039
Am I missing something? This seems to be basic functionality. Otherwise, the library works great. Super fast compared to https://github.com/svpcom/hyperloglog (which does give correct estimates)
I want to check is values an index (e.g. 1, 2, 3, ...) and I need to know inaccuracy percent. According HLL specification, accept counting error is calculated by the formula:
1.04 / math.sqrt(m),
m - number of registers (2^k),
I set k to 12 and m=2^12=4096
import math
inaccuracy_percent = 100 * (1.04 / math.sqrt(2**12)) # 1.625%
But in practice, the percentage is somewhere 2 times greater. For example, I have an 1M index values (potentially 50M). I add them to the algorithm for 100,000 values with the follow-up merge:
from HLL import HyperLogLog
values = [range((0+i)*100000, (1+i)*100000) for i in xrange(10)]
rows_counter = 0
result = None
for chunk_values in values:
rows_counter += len(chunk_values)
hll = HyperLogLog(12) # k=12
[hll.add(str(x)) for x in chunk_values]
if result:
hll.merge(result)
inaccuracy_percent = abs(1 - hll.cardinality() / rows_counter) * 100
print(rows_counter, float('%.2f' % inaccuracy_percent), 0 <= inaccuracy_percent <= 1.625)
result = hll
if 0 <= inaccuracy_percent <= 1.625:
# Values looks like index
pass
else:
# Values is not index
pass
Output:
(100000, 1.44, True)
(200000, 0.57, True)
(300000, 2.36, False)
(400000, 0.99, True)
(500000, 1.57, True)
(600000, 2.41, False)
(700000, 1.13, True)
(800000, 3.02, False)
(900000, 2.32, False)
(1000000, 2.55, False)
Why percentage is somewhere 2 times greater than 1.625? Or how can I calculate exact inaccuracy percentage?
In hll.add(data)
, if the HLL internal register was altered we can return True
otherwise False
.
My usecase is that I have a high throughput stream processing application, where I need to lookup and update multiple HLLs to an external store. The data in the stream is quite repetitive, so the HLL may not be updated most of the time.
If the HLL is not updated, then we don't need to update it to the external store. Currently, to detect if the HLL has changed, we have to store the registers, add data to the HLL and then compare the registers again with the old value. This is an expensive operation and is not needed if hll.add(data)
itself returns whether any register has been updated or not.
Redis' PFADD command also returns 1 or 0 depending on whether the HLL internal register was altered or not.
I can send a PR if interested.
I compiled and installed your library, and the result seems biased when k>13 I tried with different seeds and the estimator consistently returns 1.4 x the real count...
Do you have the same issue ?
Best,
After serializing and deserializing using pickle
, it looks like the registers in the new sketch are shifted by 9.
from HLL import HyperLogLog
import pickle
hll = HyperLogLog(p=12, sparse=False)
for x in range(1_000_000):
hll.add(str(x))
pickled_hll = pickle.loads(pickle.dumps(hll))
for i in range(hll.size()):
print(hll.get_register(i), pickled_hll.get_register(i))
Results in:
8 0
10 0
9 0
9 0
10 0
9 0
10 0
8 0
8 0
10 8
8 10
12 9
9 9
6 10
8 9
9 10
9 8
10 8
[...]
This results in some large miscounts in cardinality when merged deserialized sketches:
from HLL import HyperLogLog
import pickle
sets = [
range(0, 1_000_000),
range(500_000, 1_500_000),
range(1_000_000, 2_000_000)
]
def count_cardinality(sets):
all_items = set()
for s in sets:
all_items.update(s)
return len(all_items)
def count_cardinality_hll_merged(sets):
all_items = HyperLogLog(sparse=False)
for s in sets:
hll = HyperLogLog(sparse=False)
for item in s:
hll.add(str(item))
all_items.merge(hll)
return all_items.cardinality()
def count_cardinality_hll_pickled_and_merged(sets):
all_items = HyperLogLog(sparse=False)
for s in sets:
hll = HyperLogLog(sparse=False)
for item in s:
hll.add(str(item))
all_items.merge(pickle.loads(pickle.dumps(hll)))
return all_items.cardinality()
print(count_cardinality(sets), count_cardinality_hll_merged(sets), count_cardinality_hll_pickled_and_merged(sets))
gives:
2000000 1987453 802337
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.