GithubHelp home page GithubHelp logo

isartmontane / mysql-hyperloglog Goto Github PK

View Code? Open in Web Editor NEW

This project forked from amirtuval/mysql-hyperloglog

0.0 1.0 0.0 393 KB

A mysql plugin to to use the hyperloglog algorithm

CMake 4.14% C++ 71.60% C 21.53% Ruby 2.73%

mysql-hyperloglog's Introduction

mysql-hyperloglog

A MySQL plugin for the HyperLogLog Algorithm.
When installed, the plugin adds several aggregate functions to mysql, that allows you to estimate the distinct count of large datasets.
You can use it as a replacement for count(distinct). However, the real power of the plugin is by storing the hyperloglog result in an aggregate table, which allows you to get an estimate of the distinct count from the aggregate.

Note: An estimated count is just like it sounds - an estimate, and therefore, not 100% accurate. You can expect several percentages of difference between the actual count and the estimated count, so it might not be suitable for all use cases.
Hyperloglog trades space for accuracy, so you can increase the accuracy by allowing HLL to store more data. More on that later.

Thanks

The excellent cpp-hyperloglog project is used for the actual HyperLogLog implementation.
Thanks, Hideaki Ohno.

Usage

The plugin includes 4 aggregate functions:

HLL_CREATE - Given a list of values, this function will return the HLL string computed from these values.
HLL_COMPUTE - Given a list of values, this function will return an integer representing the estimated distinct count of these values.
HLL_MERGE - Given a list of HLL strings, this function will return the HLL string that is the combination of all the hll strings.
HLL_MERGE_COMPUTE - Given a list of HLL strings, this function will return an integer representing the estimated distinct count of these values.

HyperLogLog stores its data as a byte vector. An HLL string is the base64 representation of the byte vector.
In its default implementation, the plugin uses hyperloglog with a 10 bit width, resulting in a 1024 bytes vector. Base64 string of that is ~1400 chars long.

Here's a more detailed example.

NOTE: Each of the above functions can accept any number of arguments. Each value will be treated as an additional value added.
So you can issue the following statement, for example:

mysql> select hll_compute(1,2,4,3,4,2,1);
+----------------------------+
| hll_compute(1,2,4,3,4,2,1) |
+----------------------------+
|                          4 |
+----------------------------+
1 row in set (0.01 sec)

Installation

Compilation

The project uses CMake as a build tool, and the code is platform independent, so it should compile fine on most platforms (linux, windows, mac).
There might be some issues around compiling cmake that are beyond the scope of this guide to resolve. Please consult the documentation/relevant forums for it. For example, CMake has some issues with compiling 64 bit projects on windows.

Linux

Tested on ubuntu precise (12.04) 64 bit, but should work pretty much the same on most linux distros.

Prerequisites: Make sure you have the cmake, build-essential and libmysqlclient-dev packages installed.

Run the following from the project's root directory

git submodule update --init
cmake .
make

NOTE: You may need to tell cmake where to find mysql header files. The deault is "/usr/include/mysql", but if they are located in another directory on your machine, add "-DMYSQL_INCLUDE_DIR={DIR}" to the cmake command line.

After a successful compilation, you will have the libmysqlhll.so binary under the libmysqlhll/ directory.

Windows

Tested on Windows 7 with Visual Studio 2010 proffessional, but should work pretty much the same on most windows machines.

Prerequisites: Make sure you have CMake installed.

Run the following from the project's root directory, in a Visual Studio command prompt (open either 32 or 64 bit command prompt, depending on your chosen target CPU):

git submodule update --init
cmake .
msbuild mysqlhll.sln /p:Configuration=Release

NOTE: You may need to tell cmake where to find mysql header files. The deault is "C:\Program Files\MySQL\MySQL Server 5.6\include", but if they are located in another directory on your machine, add "-DMYSQL_INCLUDE_DIR={DIR}" to the cmake command line.

After that, mysqlhll.dll file will be located under libmysqlhll\Release.

MySQL Installation

Once the binary is compiled (.so on linux, .dll on windows), you have to copy it to the MySQL plugins dir.
If you are not sure where that is, you can check by running this command

show variables like '%plugin%';

in mysql client (On linux, it is usually /usr/lib/mysql/plugin/).

After that login to mysql as root, and run the functions installation script. Replace .so with .dll if you are on windows.

Customization

If you would like to change the bit width of the HyperLogLog algorithm, you can change it by editing the constants.hpp file before compiling the project.
The default is 12, resulting in 2**12(4096) bytes of storage. This can go as high as 16 to get better accuracy with more storage, or you can lower it to save storage.

mysql-hyperloglog's People

Contributors

amirtuval avatar isartmontane avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.