GithubHelp home page GithubHelp logo

edoliberty / streaming-quantiles Goto Github PK

View Code? Open in Web Editor NEW
52.0 7.0 23.0 235 KB

Implements the Karnin-Lang-Liberty (KLL) algorithm in python

License: Apache License 2.0

Python 39.03% Gnuplot 0.09% Jupyter Notebook 60.88%

streaming-quantiles's Introduction

streaming-quantiles

This code implements the simplest algorithm described in the paper Optimal Quantile Approximation in Streams (FOCS 2016) by Zohar Karnin, Kevin Lang and myself (Edo Liberty).

  • It is distributed free and with no warranty of any kind.
  • It should NOT be considered ready for production use.
  • It is written as an academic tool for readers of the paper to reproduce our results.
  • It is not optimized in any way.

Usage

Main class in kll.py

$ python kll.py -h

usage: kll.py [-h] [-k K] [-t {string,int,float}]
	optional arguments:
      -h, --help            show this help message and exit
      -k K                  controls the number of elements in the sketch which is
                            at most 3k+log2(n). n is the length of the stream.
      -t {string,int,float}
                            defines the type of stream items, default="string"

For convenience, the class StreamMaker is also included.

$ python streamMaker.py -h

usage: streamMaker.py [-h] [-n N] [-o {sorted,zoomin,zoomout,sqrt,random}]
optional arguments:
    -h, --help            show this help message and exit
	-n N                  the number of generated elements
	-o {sorted,zoomin,zoomout,sqrt,random}
                   		  the order of the streamed integers.

Example

Try the following:

python streamMaker.py -n 1000 -o zoomin | python kll.py -k 32 -t int

Or, if you want to save the files and look at the input-output do this:

python streamMaker.py -n 1000 -o zoomin > numbers.txt
cat numbers.txt | python kll.py -k 32 -t int > cdf.csv

You should get an approximate CDF of the input stream. The file cdf.csv should look something like this:

0.016000	7
0.032000	23
0.048000	39
0.064000	55
0.080000	72
.
.
.
0.936000	923
0.952000	940
0.968000	956
0.984000	973
1.000000	989 

You can also try something (admittedly odd) like this:

man grep | python kll.py

This will give approximate quantiles of the lines in the grep man page where the order of lines is lexicographic.

Plotting

This requires having gnuplot installed. On macs you can install like this:

brew install gnuplot

Assume you created a cdf.csv file like this:

python streamMaker.py -n 1000 -o random > numbers.txt
cat numbers.txt | python kll.py -k 32 -t int > cdf.csv

You can than plot the approximate quantiles the sketch holds. A single dot is plotted for each item stored by the sketch.

cat cdf.csv | gnuplot plot.gp > plot.eps; open plot.eps

You can then increase k and see that the plot aligns better along a straight diagonal line. Since the streams created by the stream maker are consecutive numbers 1,...,n a straight diagonal is the optimal result.

cat numbers.txt | python kll.py -k 16 -t int | gnuplot plot.gp > plot16.eps; open plot16.eps
cat numbers.txt | python kll.py -k 32 -t int | gnuplot plot.gp > plot32.eps; open plot32.eps
cat numbers.txt | python kll.py -k 64 -t int | gnuplot plot.gp > plot64.eps; open plot64.eps

Keep in mind that larger values of k also increase the size of the sketch.

streaming-quantiles's People

Contributors

edoliberty avatar leerho avatar mvsusp avatar pavelvesely avatar tdunning avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

streaming-quantiles's Issues

Sampler?

Dear Author Liberty,
Thank you so much for providing the KLL sketch codes. In the "Optimal Quantile Approximation in Streams" and "Streaming Quantiles Algorithms" they both talked about replace the log(n) stack of compactors of capacity 2 with a sampler. I have serveral questions regarding the sampler.
If without sampler, will KLL becomes a deterministic algorithm?
In this implementation, sampler are not included?

With Many Thanks!

How to set k?

Hi,
I am a little bit confused about this.
If I want to set the ε to 0.01, which means that | R̃(x) − R(x)| ≤ 0.01 * n. What value should k be set?
Is there some certain connection between k and ε? I am looking forward to your reply.

What is the license for this software?

The important code files have a file header that says that they are Apache licensed, but the README says
this:

It should NOT be used for any commercial purposes.

Such a restriction is incompatible with the Apache license. If you want the code to be licensed as
Apache, it would help your readers if you would clarify this statement. Perhaps you meant that it
should be read as "this code should not be considered production ready"?

In addition, if you want the Apache license to apply and for this to be recognized by the github
automated scanners, it would help all around if you would insert a LICENSE containing the
full and literal text of the Apache.

My own stake in this is that I would like to use this code for testing and benchmarking of the t-digest,
but since I have a strict Apache license on the entire t-digest distro, I can't afford the ambiguity here.

I will produce a pull request with these changes that you can accept for deny according to your preference.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.