GithubHelp home page GithubHelp logo

mymizan / gzint Goto Github PK

View Code? Open in Web Editor NEW

This project forked from pirate/gzint

0.0 2.0 0.0 1.25 MB

:scissors: A python3 library for efficiently storing massive integers (stands for gzipped-integer).

Home Page: https://pypi.python.org/pypi/gzint/

Python 100.00%

gzint's Introduction

gzint: A library for storing huge integeters efficiently

This python library helps store massive integers by using a gzipped-string representation in memory. It makes storing and comparing huge integers fast and lightweight, while gracefully falling back to normal integer operations when math is needed. It works as a drop-in replacement for int.

Quickstart:

pip3 install gzint
>>>from gzint import HugeInt

>>>normal_int = 10**1000000        # huge, but compressable (lots of 0's)
>>>huge_int = HugeInt(normal_int)

# HugeInts are useful when needing to store lots of large numbers without running out of memory
# Notice how the memory footprint of a normal int is much larger than the equivalent HugeInt
>>>normal_int.__sizeof__()
442948                      # half a megabyte!!
>>>huge_int._value.__sizeof__()
1025                        # only 1kb

# HugeInts and normal ints are interchageably comparable, and have the same hashes
>>>HugeInt(5) == 5
True
>>>HugeInt(5) + 5
10
>>>HugeInt(5) + HugeInt(5)
10
>>>5 in {HugeInt(5), 6, 7}
True

Theory:

This library is not magic, I have not somehow figured out how to break the pigeon-hole principle. It simply exploits the fact that most large numbers we work with in real life are not 100% random, and either contain repeating patterns (like lots of 0's) or can be represented compactly by using using notations like scientific notation, factorial notation, knuth's up-arrow notation, etc..

Do not bother trying to use this library if you're actually reading random data, it will only make your ints bigger.

The alpha implementation works by compressing repeating patterns in the base-10 representation of ints, which works very well for large numbers with lots of 0's, or any of the same digit repeated. I'm working on adding other compression schemes, and automatically picking the one with the most memory savings (which may require adding threading to compress the int in several different ways concurrently).

Another possible option is to try and compress all the ints used across an entire program, by storing some state every time a HugeInt is created, and seeing if patterns exist globally that can be compressed together.

Docs:

HugeInt is a type which aids in storing very large, but compressable numbers in memory in python >= 3.5. It sacrifices CPU time during intialization and math operations, for fast comparisons and at-rest memory efficiency.

HugeInt implements the int interface, you can almost always treat it like a normal python int.

HugeInt provides the following methods that differ from int:

 - HugeInt.__init__:   Initialize a HugeInt from an `int` or str representation
 - HugeInt.__eq__:     Efficiently compare a `HugeInt` with another `HugeInt` or `int`
 - HugeInt.__str__:    Get the full `str` representation of the `HugeInt`
 - HugeInt.__repr__:   Get a short representation of the `HugeInt` suitable for console display
 - HugeInt.__hash__:   Get the `__hash__` of the uncompressed `int`
 - HugeInt.to_int:     Get the `int` representation of the `HugeInt`

Because HugeInt stores a compressed representation of the number, fast, direct math operations are not possible. For the following operations, the number gets de-compressed, the operation performed using the int equivalent method, and then the result is re-compressed and returned as a HugeInt (which can be very slow).

__abs__, __add__, __and__, __ceil__, __floor__, __floordiv__, __int__, __invert__, __le__, __lshift__, __lt__, __mod__, __mul__, __neg__, __or__, __pos__, __pow__, __radd__, __rand__, __rfloordiv__, __rlshift__, __rmod__, __rmul__, __ror__, __round__, __rpow__, __rrshift__, __rshift__, __rsub__, __rtruediv__, __rxor__, __sub__, __truediv__, __trunc__, __xor__

Example Use Case:

Read a file full of huge numbers, and check to see which ones occur more than once (in O(n) time).

numbers_seen = set()

for line in open('big_data.txt', 'r'):
    compressed_int = HugeInt(line)
    if compressed_int in numbers_seen:
        print('Found a familiar number:', repr(compressed_int))
    numbers_seen.add(compressed_int)

del line

if 1000 in numbers_seen:
    print('Saw 1000')

if HugeInt(10**1000000) in numbers_seen:
    print('Saw 10^1,000,000')

Why HugeInt is slow to init:

You man notice that initializing big HugeInts takes some time. This is because HugeInt uses the gzip "deflate" algorithm, and must perform an O(n) pass over the number, where n is the number of digits in base-10. Due to this initial cost, it's recommended to avoid using HugeInts for applications where you will need to re-initialize many HugeInts, or perform many math operations on HugeInts in memory.

Right now, only __eq__ (==) and __hash__ (in) are optimized to work directly on the compressed number, other operations will fall back to decompressing back to an int and using the slower int math methods, then recompressing the returned value.

Development:

git clone https://github.com/pirate/gzint.git       # python3.5 is the only dependency (brew install python3)
cd gzint
python3.5 setup.py test                             # optional, check that tests are passing
python3.5 setup.py install
# all code is inside gzint/main.py

TODOs:

  1. Implement more compression methods and automatically pick the best one (using Threads to compress in parallel?)
    • gzipped hex, binary, octal, or other base representations of the number
    • scientific notation
    • knuth's up-arrow notation
    • factorial notation
    • prime factor notation
    • other polynomial representations
  2. Fall back to storing the int uncompressed if compression ends up making it bigger
  3. Speed up the compression & decompression
  4. See if more math operations can be performed directly on compressed HugeInts without uncompressing first
  5. Use a cached_property to prevent decompressing the same HugeInt repeatedly during int operations (allow expiry eventually with timeout to get GC benefits...?)

gzint's People

Contributors

pirate avatar tirkarthi avatar

Watchers

M Yakub Mizan avatar James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.