ascv / hyperloglog Goto Github PK

View Code? Open in Web Editor NEW

99.0 99.0 19.0 272 KB

Fast HyperLogLog for Python.

License: MIT License

C 82.20% Python 17.36% Makefile 0.44%

cardinality cardinality-counter cardinality-estimation hyperloglog python

hyperloglog's People

Contributors

Stargazers

Watchers

Forkers

dfuhry tempbottle chhsiao1981 b3au taozi1314 abzac wangfengfighting josephsu python3pkg bz2 mauguignard mayankasthana ilya-lt jqk6 zwindl wordhou alex3john yongquanf alexmy21

hyperloglog's Issues

Error importing after installation

I'm running on Mac OS X 10.10.1 and had an issue importing the package after installing:

Python 2.7.8 (default, Dec 10 2014, 14:56:51) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import HLL
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: dlopen(/usr/local/lib/python2.7/site-packages/HLL.so, 2): Symbol not found: _PyInt_AS_LONG
  Referenced from: /usr/local/lib/python2.7/site-packages/HLL.so
  Expected in: flat namespace
 in /usr/local/lib/python2.7/site-packages/HLL.so

Output from installation:

λ pip install HLL
Downloading/unpacking HLL
  Downloading HLL-0.833.tar.gz
  Running setup.py (path:/private/var/folders/km/n8n4rfqs3_sbcrt7zd8bvyww0000gn/T/pip_build_jmalina/HLL/setup.py) egg_info for package HLL

Installing collected packages: HLL
  Running setup.py install for HLL
    building 'HLL' extension
    clang -fno-strict-aliasing -fno-common -dynamic -I/usr/local/include -I/usr/local/opt/sqlite/include -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c hll.c -o build/temp.macosx-10.10-x86_64-2.7/hll.o
    hll.c:198:26: warning: implicit declaration of function 'PyByteArray_AsString' is invalid in C99 [-Wimplicit-function-declaration]
        char *hllRegisters = PyByteArray_AsString(hllByteArray);
                             ^
    hll.c:198:11: warning: incompatible integer to pointer conversion initializing 'char *' with an expression of type 'int' [-Wint-conversion]
        char *hllRegisters = PyByteArray_AsString(hllByteArray);
              ^              ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    hll.c:218:17: warning: implicit declaration of function 'PyByteArray_FromStringAndSize' is invalid in C99 [-Wimplicit-function-declaration]
        registers = PyByteArray_FromStringAndSize(self->registers, self->size);
                    ^
    hll.c:218:15: warning: incompatible integer to pointer conversion assigning to 'PyObject *' (aka 'struct _object *') from 'int' [-Wint-conversion]
        registers = PyByteArray_FromStringAndSize(self->registers, self->size);
                  ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    hll.c:234:15: warning: comparison of unsigned expression < 0 is always false [-Wtautological-compare]
        if (index < 0) {
            ~~~~~ ^ ~
    hll.c:252:14: warning: comparison of unsigned expression < 0 is always false [-Wtautological-compare]
        if (rank < 0) {
            ~~~~ ^ ~
    6 warnings generated.
    clang -fno-strict-aliasing -fno-common -dynamic -I/usr/local/include -I/usr/local/opt/sqlite/include -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/Cellar/python/2.7.8_2/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c murmur3.c -o build/temp.macosx-10.10-x86_64-2.7/murmur3.o
    murmur3.c:21:15: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier]
    static inline FORCE_INLINE uint32_t rotl32 ( uint32_t x, int8_t r )
                  ^
    murmur3.c:16:53: note: expanded from macro 'FORCE_INLINE'
    #define FORCE_INLINE __attribute__((always_inline)) inline
                                                        ^
    murmur3.c:26:15: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier]
    static inline FORCE_INLINE uint64_t rotl64 ( uint64_t x, int8_t r )
                  ^
    murmur3.c:16:53: note: expanded from macro 'FORCE_INLINE'
    #define FORCE_INLINE __attribute__((always_inline)) inline
                                                        ^
    murmur3.c:45:15: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier]
    static inline FORCE_INLINE uint32_t fmix32 ( uint32_t h )
                  ^
    murmur3.c:16:53: note: expanded from macro 'FORCE_INLINE'
    #define FORCE_INLINE __attribute__((always_inline)) inline
                                                        ^
    murmur3.c:58:15: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier]
    static inline FORCE_INLINE uint64_t fmix64 ( uint64_t k )
                  ^
    murmur3.c:16:53: note: expanded from macro 'FORCE_INLINE'
    #define FORCE_INLINE __attribute__((always_inline)) inline
                                                        ^
    murmur3.c:26:37: warning: unused function 'rotl64' [-Wunused-function]
    static inline FORCE_INLINE uint64_t rotl64 ( uint64_t x, int8_t r )
                                        ^
    murmur3.c:58:37: warning: unused function 'fmix64' [-Wunused-function]
    static inline FORCE_INLINE uint64_t fmix64 ( uint64_t k )
                                        ^
    6 warnings generated.
    clang -bundle -undefined dynamic_lookup -L/usr/local/lib -L/usr/local/opt/sqlite/lib -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk build/temp.macosx-10.10-x86_64-2.7/hll.o build/temp.macosx-10.10-x86_64-2.7/murmur3.o -o build/lib.macosx-10.10-x86_64-2.7/HLL.so

Successfully installed HLL
Cleaning up...

Add command line option to seed the rank array

This will allow the user to set the initial array of ranks and merge cardinality estimators.

Accuracy of cardinality?

What kind of accuracy should I expect? Even using k=16 I get very poor results (error ~45%) after adding 1024 items:

$ cat hll.py 
from HLL import HyperLogLog

hll = HyperLogLog(16)
for x in range(1024):
    hll.add(str(x))
print hll.cardinality()

$ python hll.py
1481.65524266

Add intersection and similarity

First of all, this library is great. Thank you!

I wanted to suggest two features found in some other HLL libraries that (as far as I can tell) are missing here. It is possible to compute an estimated cardinality of the intersection between two HLLs and an estimated similarity between two HLLs.

Here is a Go library that focuses on these two features: https://github.com/axiomhq/hyperminhash

It would be awesome to have C implementations of these operations exposed to python as part of the HyperLogLog object.

hll1.intersection(hll2)
hll1.similarity(hll2)

add 3.x support

string repr of the hll

HI,

I wanna store the hll result into db, is there a way to get concise integer or string form of the hll instance?

thanks

Incorrect cardinality after merge

hll1 = HyperLogLog(16)
hll2 = HyperLogLog(16)
for i in range(100000):
    hll1.add(str(random.random()))
for i in range(100000):
    hll2.add(str(random.random()))
for i in range(100000):
    r = str(random.random())
    hll1.add(r)
    hll2.add(r)

print(hll1.cardinality())
print(hll2.cardinality())

hll1.merge(hll2)
print(hll1.cardinality())

gives

201675.90068912748
201259.4273446178
201675.90068912748

the last one should be ~300k.

A typo was found in the comments of `src/hll.c`

After reading the leading comments in this file, I think the illustration in line 48 was wrong, the byte after b1 should be b2 rather than b3.

/* ========================== Dense representation ========================= */
/*
 * Since register values will never exceed 64 we store them using only 6 bits.
 * This encoding is diagrammed below:
 *
 *          b0        b1        b3        b4
 *          /         /         /         /
 *     +-------------------+---------+---------+
 *     |0000 0011|1111 0011|0110 1110|1111 1011|
 *     +-------------------+---------+---------+
 *      |_____||_____| |_____||_____| |_____|
 *         |      |       |      |       |
 *       offset   m1      m2     m3     m4
 *
 *      b = bytes, m = registers
 *

And I fixed this little issue in this PR: #41

Actually, that's barely an issue :P
Thanks for the great work! I'm still learning the HLL algorithm.

Segfaults if wrong type is passed to set_registers()

I have a use case where I need to serialize a HLL then unserialize it later. I used registers() and set_registers() for this. However, due to the nature of the serialization, the registers were converted from a bytearray to a bytes. Not realising this, I attempted to call set_registers with this value, and my program promptly segfaulted due to this code:

    registers = PyByteArray_AsString((PyObject*) regs);
    self->use_cache = 0;

    int i;
    for (i = 0; i < self->size; i++) {
        self->registers[i] = registers[i];
    }

because the result of PyByteArray_AsString is not checked.

The expected behaviour when passed an inappropriate type would be a TypeError or other exception. Better yet would be to accept a bytes as input to this function.

For counter with n registers, it is possible to set the n+1'th register

In line 240 of hll.c, there is a check that index is within size.

if (index > self->size) {
    char * msg = "Index greater than the number of registers.";
    PyErr_SetString(PyExc_IndexError, msg);
    return NULL;
}

Should that not be index >= self->size ?

small range correction not working correctly

The small range correction is returning estimates that are abnormally off for a linear count.

Add support for command line options

Add command line option to set k

use python hll to get the HLL struct made by c++

the HLL struct is construct in c++ and record in the data stream, made a python script read this the data and decode out the cardinality of the hll data, is there any interface to do this?

`merge()` is broken in 2.0.1

The method hasn't been updated to handle sparse vs dense representations.

pickle dump leaks memory

Reproduction:

from HLL import HyperLogLog
import pickle
import os
import psutil

process = psutil.Process(os.getpid())
def print_memory_usage():
    B = process.memory_info().rss
    MB = B / (1<<20)
    print(f'{MB} MB')


for i in range(10000):
    hll = HyperLogLog(14)
    pickle.dumps(hll)
    if i % 1000 == 0:
        print_memory_usage()

Output

46.69140625 MB
78.109375 MB
109.5625 MB
140.7578125 MB
172.32421875 MB
203.77734375 MB
235.23046875 MB
266.42578125 MB
297.87890625 MB
329.33203125 MB

Without pickle.dumps

46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB
46.63671875 MB

Environment

Python version: 3.7.3

Add command line option to print the rank array

Write readme

Add program description, list of commands, example usage, and an explanation of the algorithm.

add tests

Serialization/deserialization of HyperLogLog objects leads to big error gap

Hi,

First of all, thank you very much for this implementation. While playing with the library, I found out that serializing and deserializing a HyperLogLog object and then merging it to another leads to a big drop in accuracy. Here is the code to reproduce:

Python: 3.9.16
HLL: 2.0.3

from HLL import HyperLogLog
import random 
import pickle 
random.seed(0)
def test_union_precision(serde=False):
    union_count = 1000
    candidate_values = [str(i) for i in range(100_000)]
    picked_values = set()
    agg_hll = HyperLogLog(p=8, seed = 0)
    for _ in range(union_count):
        hll = HyperLogLog(p=8, seed = 0)
        values = random.sample(candidate_values, k=random.randint(0, 100))
        picked_values.update(values)
        for v in values:
            hll.add(v)
        if(serde):
            hll = pickle.loads(pickle.dumps(hll))
        agg_hll.merge(hll)

    deviation = agg_hll.cardinality()/len(picked_values)
    return deviation

print(test_union_precision(serde=False), test_union_precision(serde=True))

gives

1.048  0.130

I have seen the issues resolved previously and indeed my registers are all the same before and after serialization/deserialization so I suspect the error to be somewhere else but I am not familiar enough with the codebase to find it.

Thank you in advance for your help

wrong cardinality estimates

I'm getting getting estimates which are way off. Here's some simple code to show the problem:

import HLL
from random import choice

#Sample 5000 numbers from [0..1000], i.e. ~20% uniques 
possible = range(1000)
sample = [choice(possible) for _ in range(5000)] #sample with repetition

print "True cardinality: {}".format(len(set(sample)))

hll = HLL.HyperLogLog(14)
for s in sample:
  hll.add(str(s))
print "Estimated cardinality: {}".format(hll.cardinality())

which outputs:

True cardinality: 987
Estimated cardinality: 1431.81903039

Am I missing something? This seems to be basic functionality. Otherwise, the library works great. Super fast compared to https://github.com/svpcom/hyperloglog (which does give correct estimates)

Inaccuracy percentage

I want to check is values an index (e.g. 1, 2, 3, ...) and I need to know inaccuracy percent. According HLL specification, accept counting error is calculated by the formula:

1.04 / math.sqrt(m),
    m - number of registers (2^k),

I set k to 12 and m=2^12=4096

import math
inaccuracy_percent = 100 * (1.04 / math.sqrt(2**12))  # 1.625%

But in practice, the percentage is somewhere 2 times greater. For example, I have an 1M index values (potentially 50M). I add them to the algorithm for 100,000 values with the follow-up merge:

from HLL import HyperLogLog

values = [range((0+i)*100000, (1+i)*100000) for i in xrange(10)]
rows_counter = 0
result = None
for chunk_values in values:
    rows_counter += len(chunk_values)
    hll = HyperLogLog(12)  # k=12
    [hll.add(str(x)) for x in chunk_values]
    if result:
        hll.merge(result)
    inaccuracy_percent = abs(1 - hll.cardinality() / rows_counter) * 100
    print(rows_counter, float('%.2f' % inaccuracy_percent), 0 <= inaccuracy_percent <= 1.625)
    result = hll
    if 0 <= inaccuracy_percent <= 1.625:
        # Values looks like index
        pass
    else:
        # Values is not index
        pass

Output:

(100000, 1.44, True)
(200000, 0.57, True)
(300000, 2.36, False)
(400000, 0.99, True)
(500000, 1.57, True)
(600000, 2.41, False)
(700000, 1.13, True)
(800000, 3.02, False)
(900000, 2.32, False)
(1000000, 2.55, False)

Why percentage is somewhere 2 times greater than 1.625? Or how can I calculate exact inaccuracy percentage?

Return true if registers changed on add method

In hll.add(data), if the HLL internal register was altered we can return True otherwise False.

My usecase is that I have a high throughput stream processing application, where I need to lookup and update multiple HLLs to an external store. The data in the stream is quite repetitive, so the HLL may not be updated most of the time.

If the HLL is not updated, then we don't need to update it to the external store. Currently, to detect if the HLL has changed, we have to store the registers, add data to the HLL and then compare the registers again with the old value. This is an expensive operation and is not needed if hll.add(data) itself returns whether any register has been updated or not.

Redis' PFADD command also returns 1 or 0 depending on whether the HLL internal register was altered or not.

I can send a PR if interested.

Biased estimation for k>13

I compiled and installed your library, and the result seems biased when k>13 I tried with different seeds and the estimator consistently returns 1.4 x the real count...

Do you have the same issue ?

Best,

Registers are off after serializing and deserializing

After serializing and deserializing using pickle, it looks like the registers in the new sketch are shifted by 9.

from HLL import HyperLogLog
import pickle

hll = HyperLogLog(p=12, sparse=False)
for x in range(1_000_000):
    hll.add(str(x))

pickled_hll = pickle.loads(pickle.dumps(hll))

for i in range(hll.size()):
    print(hll.get_register(i), pickled_hll.get_register(i))

Results in:

This results in some large miscounts in cardinality when merged deserialized sketches:

from HLL import HyperLogLog
import pickle

sets = [
    range(0, 1_000_000),
    range(500_000, 1_500_000),
    range(1_000_000, 2_000_000)
]

def count_cardinality(sets):
    all_items = set()
    for s in sets:
        all_items.update(s)
    return len(all_items)

def count_cardinality_hll_merged(sets):
    all_items = HyperLogLog(sparse=False)
    for s in sets:
        hll = HyperLogLog(sparse=False)
        for item in s:
            hll.add(str(item))
        all_items.merge(hll)
    return all_items.cardinality()

def count_cardinality_hll_pickled_and_merged(sets):
    all_items = HyperLogLog(sparse=False)
    for s in sets:
        hll = HyperLogLog(sparse=False)
        for item in s:
            hll.add(str(item))
        all_items.merge(pickle.loads(pickle.dumps(hll)))
    return all_items.cardinality()

print(count_cardinality(sets), count_cardinality_hll_merged(sets), count_cardinality_hll_pickled_and_merged(sets))

gives:

2000000 1987453 802337