GithubHelp home page GithubHelp logo

rogdham / python-xz Goto Github PK

View Code? Open in Web Editor NEW
24.0 2.0 2.0 155 KB

Pure Python implementation of the XZ file format with random access support

License: MIT License

Python 100.00%
python library xz compression decompression

python-xz's Introduction

python-xz

Pure Python implementation of the XZ file format with random access support

Leveraging the lzma module for fast (de)compression

GitHub build status Release on PyPI Code coverage Mypy type checker MIT License


๐Ÿ“– Documentationย ย ย |ย ย ย ๐Ÿ“ƒ Changelog


A XZ file can be composed of several streams and blocks. This allows for fast random access when reading, but this is not supported by Python's builtin lzma module (which would read all previous blocks for nothing).

lzma lzmaffi python-xz
module type builtin cffi (C extension) pure Python
๐Ÿ“„ read
random access โŒ no1 โœ”๏ธ yes2 โœ”๏ธ yes2
several blocks โœ”๏ธ yes โœ”๏ธโœ”๏ธ yes3 โœ”๏ธโœ”๏ธ yes3
several streams โœ”๏ธ yes โœ”๏ธ yes โœ”๏ธโœ”๏ธ yes4
stream padding โŒ no5 โœ”๏ธ yes โœ”๏ธ yes
๐Ÿ“ write
w mode โœ”๏ธ yes โœ”๏ธ yes โœ”๏ธ yes
x mode โœ”๏ธ yes โŒ no โœ”๏ธ yes
a mode โœ”๏ธ new stream โœ”๏ธ new stream โณ planned
r+/w+/โ€ฆ modes โŒ no โŒ no โœ”๏ธ yes
several blocks โŒ no โŒ no โœ”๏ธ yes
several streams โŒ no6 โŒ no6 โœ”๏ธ yes
stream padding โŒ no โŒ no โณ planned
Notes
  1. Reading from a position will read the file from the very beginning
  2. Reading from a position will read the file from the beginning of the block
  3. Block positions available with the block_boundaries attribute
  4. Stream positions available with the stream_boundaries attribute
  5. Related issue
  6. Possible by manually closing and re-opening in append mode

Install

Install python-xz with pip:

$ python -m pip install python-xz

An unofficial package for conda is also available, see issue #5 for more information.

Usage

The API is similar to lzma: you can use either xz.open or xz.XZFile.

Read mode

>>> with xz.open('example.xz') as fin:
...     fin.read(18)
...     fin.stream_boundaries  # 2 streams
...     fin.block_boundaries   # 4 blocks in first stream, 2 blocks in second stream
...     fin.seek(1000)
...     fin.read(31)
...
b'Hello, world! \xf0\x9f\x91\x8b'
[0, 2000]
[0, 500, 1000, 1500, 2000, 3000]
1000
b'\xe2\x9c\xa8 Random access is fast! \xf0\x9f\x9a\x80'

Opening in text mode works as well, but notice that seek arguments as well as boundaries are still in bytes (just like with lzma.open).

>>> with xz.open('example.xz', 'rt') as fin:
...     fin.read(15)
...     fin.stream_boundaries
...     fin.block_boundaries
...     fin.seek(1000)
...     fin.read(26)
...
'Hello, world! ๐Ÿ‘‹'
[0, 2000]
[0, 500, 1000, 1500, 2000, 3000]
1000
'โœจ Random access is fast! ๐Ÿš€'

Write mode

Writing is only supported from the end of file. It is however possible to truncate the file first. Note that truncating is only supported on block boundaries.

>>> with xz.open('test.xz', 'w') as fout:
...     fout.write(b'Hello, world!\n')
...     fout.write(b'This sentence is still in the previous block\n')
...     fout.change_block()
...     fout.write(b'But this one is in its own!\n')
...
14
45
28

Advanced usage:

  • Modes like r+/w+/x+ allow to open for both read and write at the same time; however in the current implementation, a block with writing in progress is automatically closed when reading data from it.
  • The check, preset and filters arguments to xz.open and xz.XZFile allow to configure the default values for new streams and blocks.
  • Change block with the change_block method (the preset and filters attributes can be changed beforehand to apply to the new block).
  • Change stream with the change_stream method (the check attribute can be changed beforehand to apply to the new stream).

FAQ

How does random-access works?

XZ files are made of a number of streams, and each stream is composed of a number of block. This can be seen with xz --list:

$ xz --list file.xz
Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
    1      13     16.8 MiB    297.9 MiB  0.056  CRC64   file.xz

To read data from the middle of the 10th block, we will decompress the 10th block from its start it until we reach the middle (and drop that decompressed data), then returned the decompressed data from that point.

Choosing the good block size is a tradeoff between seeking time during random access and compression ratio.

How can I create XZ files optimized for random-access?

You can open the file for writing and use the change_block method to create several blocks.

Other tools allow to create XZ files with several blocks as well:

  • XZ Utils needs to be called with flags:
$ xz -T0 file                          # threading mode
$ xz --block-size 16M file             # same size for all blocks
$ xz --block-list 16M,32M,8M,42M file  # specific size for each block
  • PIXZ creates files with several blocks by default:
$ pixz file

Python version support

As a general rule, all Python versions that are both released and still officially supported are supported by python-xz and tested against (both CPython and PyPy implementations).

If you have other use cases or find issues with some Python versions, feel free to open a ticket!

python-xz's People

Contributors

rogdham avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

python-xz's Issues

High memory usage

Hello,

During running some of my benchmarks for ratarmount, I noticed one of them failing with a 100GB .tar.xz test file. The memory usage increases while linearly while reading over the file and exceeded 33GB and which point it was killed by the OOM killer. It works with other compression backends, so I'm pretty sure the memory usage stems from python-xz. Do you have any idea from where it might originate from? I would have thought that you only need to story block and stream offsets, i.e., list of integers, which should be small.

Providing conda packages

Hello @Rogdham

I am starting work on providing conda packages for ratarmount and its dependencies.
Would you be interested in providing python-xz via conda?
If yes, I could offer to create the feedstock repository for you to provide packages via conda-forge. I would also be maintaining this feedstock repository, and (of course) add you as a maintainer too.

For reference, here's the discussion in the downstream repo: mxmlnkn/ratarmount#99

Let me know what you think,
Cheers,
Andreas ๐Ÿ˜ƒ

How to get version information?

It would be nice if the module had a __version__ member. I find that the easiest to use and many still support it.

I know that there is importlib.metadata but I find it to have many hurdles:

  • It requires Python 3.8. There is a backport but it requires trying two different imports and adds a new dependency for older Python versions. the dataclasses backport is easier to use because the name is interchangible.
  • It's hard to use. It requires the package name as distributed. But here the package name is python-xz as opposed to the module name which is just xz. I could not find a reliable way to get that name from the module. I tried xz.__name__ and xz.__package__, which both return xz. Maybe there is some other way?

Parallelize decoding

Xz is pretty slow compared to other compression formats. It would be really cool if python-xz could be parallelized such that it prefetches the next blocks and decodes them in parallel. I think this would be a helpful feature and unique selling point for python-xz. I don't think there is a parallelized XZ decoder for Python at all, or is there?

I'm doing something similar in indexed_bzip2. But, I am aware that this adds complexity and problems:

  • It probably won't mesh well with write and read-write opened files resulting in an obnoxious special case for read-opened files. Then again, xz can compress in parallel, so maybe that could also be possible.
  • How to handle bounded memory? It can't decode blocks in parallel if the decompressed results don't fit in memory. But, the decompressed block sizes are known and therefore could be used to limit the parallelism. One edge case would be one or multiple blocks that don't fit into memory not even alone. The easy workaround would be to fall back to a serial implementation but a more sophisticated solution should then be able to handle partial block reads inside the parallel decoder framework.
  • When using multiprocessing as opposed to multithreading, there might be problems with opening the file multiple times, e.g., on Windows. Also, file objects generally cannot be reopened. But pickling file objects to other processes also isn't possible or could introduce race conditions.

I implemented a very rudimentary sketch on top of python-xz using multiprocessing.pool.Pool. It has the same design as indexed_bzip2, which is:

  • A least-recently-used block cache containing futures to the decompressed block contents.
  • A prefetcher inside the read method, which tracks the last accessed blocks and adds the next n blocks to the block cache if it detected sequential access.
  • The read method will check the block cache and/or submit new blocks for decoding if necessary and returns the concatenated block results.

With this, I was able to speed up the decompression of a 3.1GiB xz file (decompressed 4GiB) consisting of 171 blocks by factor ~7 on an 8-core CPU (16 virtual cores):

  • serial: Reading 4294967296 B took: 187.482s
  • parallel: Reading 4294967296 B took: 26.890s

Hower, at this point I'm becoming uncertain whether this might be easier to implement inside python-xz itself or whether the wrapper is a sufficient ad-hoc solution. It only uses public methods and members of XZFile, so it should be stable during non-major version changes.

Rudimentary unfinished sketch / proof of work:

decompress-xz-parallel.py

Click to expand
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import bisect
import io
import lzma
import math
import multiprocessing.pool
import os
import resource
import sys
import time

from typing import Iterable

import xz

from parallel_xz_decoder import ParallelXZReader


def benchmark_python_xz_serial(filename):
    print("== Benchmark serial xz file decompression ==")

    size = 0
    t0 = time.time()
    with xz.open(filename, 'rb') as file:
        t1 = time.time()

        while True:
            readSize = len(file.read(32 * 1024 * 1024))
            if readSize == 0:
                break
            size += readSize

            if time.time() - t1 > 5:
                t1 = time.time()
                print(f"{t1 - t0:.2f}s {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss // 1024} MiB RSS")

        file.close()

    t1 = time.time()
    print(f"Reading {size} B took: {t1-t0:.3f}s")


def test_python_xz_parallel(filename):
    print("== Test parallel xz file decompression ==")

    size = 0
    t0 = time.time()
    with xz.open(filename, 'rb') as file, ParallelXZReader(filename, os.cpu_count()) as pfile:
        t1 = time.time()

        while True:
            readData = file.read(8 * 1024 * 1024)
            parallelReadData = pfile.read(len(readData))
            print("Read from:", file, pfile)
            if readData != parallelReadData:
                print("inequal", len(readData), len(parallelReadData))
            assert readData == parallelReadData
            readSize = len(readData)
            if readSize == 0:
                break
            size += readSize

            if time.time() - t1 > 5:
                t1 = time.time()
                print(f"{t1 - t0:.2f}s {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss // 1024} MiB RSS")

        file.close()

    t1 = time.time()
    print(f"Reading {size} B took: {t1-t0:.3f}s")


def benchmark_python_xz_parallel(filename):
    print("== Benchmark parallel xz file decompression ==")

    size = 0
    t0 = time.time()
    with ParallelXZReader(filename, os.cpu_count()) as file:
        t1 = time.time()

        while True:
            readSize = len(file.read(8 * 1024 * 1024))
            if readSize == 0:
                break
            size += readSize

            if time.time() - t1 > 5:
                t1 = time.time()
                print(f"{t1 - t0:.2f}s {resource.getrusage(resource.RUSAGE_SELF).ru_maxrss // 1024} MiB RSS")

        file.close()

    t1 = time.time()
    print(f"Reading {size} B took: {t1-t0:.3f}s")


if __name__ == '__main__':
    print("xz version:", xz.__version__)
    filename = sys.argv[1]
    benchmark_python_xz_serial(filename)
    test_python_xz_parallel(filename)
    benchmark_python_xz_parallel(filename)
    # TODO test with multistream xz

parallel_xz_decoder.py

Click to expand
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import bisect
import io
import lzma
import math
import multiprocessing.pool
import os
import resource
import sys
import time

from typing import Iterable

import xz


# TODO Add tests for everything


def overrides(parentClass):
    """Simple decorator that checks that a method with the same name exists in the parent class"""

    def overrider(method):
        assert method.__name__ in dir(parentClass)
        assert callable(getattr(parentClass, method.__name__))
        return method

    return overrider


class LruCache(dict):
    def __init__(self, size: int = 10):
        self.size = size
        self.lastUsed: List[int] = []

    def _refresh(self, key):
        if key in self.lastUsed:
            self.lastUsed.remove(key)
        self.lastUsed.append(key)

    def __setitem__(self, key, value):
        super().__setitem__(key, value)

        self._refresh(key)
        while super().__len__() > self.size:
            super().__delitem__(self.lastUsed.pop(0))

    def __getitem__(self, key):
        value = super().__getitem__(key)
        self._refresh(key)
        return value


class Prefetcher:
    def __init__(self, memorySize):
        self.lastFetched = []
        self.memorySize = memorySize

    def fetch(self, value):
        if value in self.lastFetched:
            self.lastFetched.remove(value)
        self.lastFetched.append(value)
        while len(self.lastFetched) > self.memorySize:
            self.lastFetched.pop(0)

    def prefetch(self, maximumToPrefetch) -> Iterable:
        if not self.lastFetched or maximumToPrefetch <= 0:
            return []

        consecutiveCount = 0
        values = self.lastFetched[::-1]
        for i, j in zip(values[0:-1], values[1:]):
            if i == j + 1:
                consecutiveCount += 1
            else:
                break

        # I want an exponential progression like: logStep**consecutiveCount with the boundary conditions:
        # logStep**0 = 1 (mathematically true for any logStep because consecutiveCount was chosen to fit)
        # logStep**maxConsecutiveCount = maximumToPrefetch
        #   => logStep = exp(ln(maximumToPrefetch)/maxConsecutiveCount)
        #   => logStep**consecutiveCount = exp(ln(maximumToPrefetch) * consecutiveCount/maxConsecutiveCount)
        prefetchCount = int(round(math.exp(math.log(maximumToPrefetch) * consecutiveCount / (self.memorySize - 1))))
        return range(self.lastFetched[-1] + 1, self.lastFetched[-1] + 1 + prefetchCount)


class ParallelXZReader(io.BufferedIOBase):
    # TODO test if a simple thread pool would also parallelize equally well
    """Uses a process pool to prefetch and cache decoded xz blocks"""

    def __init__(self, filename, parallelization):
        print("Parallelize:", parallelization)
        self.parallelization = parallelization - 1  # keep one core for on-demand decompression
        self.pool = multiprocessing.pool.Pool(self.parallelization)
        self.offset = 0
        self.filename = filename
        self.fileobj = xz.open(filename, 'rb')
        self.blockCache = LruCache(2 * parallelization)
        self.prefetcher = Prefetcher(4)

        assert self.fileobj.seekable() and self.fileobj.readable()

        print(self.fileobj.stream_boundaries)
        print(self.fileobj.block_boundaries)  # contains uncompressed offsets and therefore sizes -> perfect!

    def _findBlock(self, offset: int):
        blockNumber = bisect.bisect_right(self.fileobj.block_boundaries, offset)
        print("Look for offset:", offset, "found:", blockNumber)
        if blockNumber <= 0:
            return blockNumber - 1, 0, 0
        if blockNumber >= len(self.fileobj.block_boundaries) or blockNumber <= 0:
            return blockNumber - 1, offset - self.fileobj.block_boundaries[blockNumber - 1], -1

        blockSize = self.fileobj.block_boundaries[blockNumber] - self.fileobj.block_boundaries[blockNumber - 1]
        offsetInBlock = offset - self.fileobj.block_boundaries[blockNumber - 1]
        assert offsetInBlock >= 0
        assert offsetInBlock < blockSize
        return blockNumber - 1, offsetInBlock, blockSize

    def _blockSize(self, blockNumber):
        blockNumber += 1
        if blockNumber >= len(self.fileobj.block_boundaries) or blockNumber <= 0:
            return -1
        return self.fileobj.block_boundaries[blockNumber] - self.fileobj.block_boundaries[blockNumber - 1]

    @staticmethod
    def _decodeBlock(filename, offset, size):
        with xz.open(filename, 'rb') as file:
            file.seek(offset)
            return file.read(size)

    def __enter__(self):
        return self

    def __exit__(self, exception_type, exception_value, exception_traceback):
        self.close()

    @overrides(io.BufferedIOBase)
    def close(self) -> None:
        self.fileobj.close()
        self.pool.close()

    @overrides(io.BufferedIOBase)
    def fileno(self) -> int:
        # This is a virtual Python level file object and therefore does not have a valid OS file descriptor!
        raise io.UnsupportedOperation()

    @overrides(io.BufferedIOBase)
    def seekable(self) -> bool:
        return True

    @overrides(io.BufferedIOBase)
    def readable(self) -> bool:
        return True

    @overrides(io.BufferedIOBase)
    def writable(self) -> bool:
        return False

    @overrides(io.BufferedIOBase)
    def read(self, size: int = -1) -> bytes:
        print("\nread", size, "from", self.offset)
        result = bytes()
        blocks = []
        blockNumber, firstBlockOffset, blockSize = self._findBlock(self.offset)
        print("Found block:", blockNumber, blockSize, firstBlockOffset)
        if blockNumber >= len(self.fileobj.block_boundaries) or blockNumber < 0:
            return result

        pendingBlocks = sum(not block.ready() for block in self.blockCache.values())

        availableSize = blockSize - firstBlockOffset
        while True:
            # Fetch Block
            self.prefetcher.fetch(blockNumber)
            if blockNumber in self.blockCache:
                fetchedBlock = self.blockCache[blockNumber]
            else:
                print("fetch block:", blockNumber, "sized", self._blockSize(blockNumber))
                fetchedBlock = self.pool.apply_async(
                    ParallelXZReader._decodeBlock,
                    (self.filename, self.fileobj.block_boundaries[blockNumber], self._blockSize(blockNumber)),
                )
                self.blockCache[blockNumber] = fetchedBlock
                pendingBlocks += 1

            blocks.append(fetchedBlock)
            if size <= availableSize or blockSize == -1:
                break
            size -= availableSize
            self.offset += availableSize

            # Get metadata for next block
            blockNumber += 1
            if blockNumber >= len(self.fileobj.block_boundaries):
                break
            blockSize = self._blockSize(blockNumber)
            offsetInBlock = self.offset - self.fileobj.block_boundaries[blockNumber - 1]

            availableSize = blockSize - offsetInBlock

        # TODO apply prefetch suggestion
        maxToPrefetch = self.parallelization - pendingBlocks
        toPrefetch = self.prefetcher.prefetch(self.parallelization)
        print("Prefetch suggestion:", toPrefetch)
        for blockNumber in toPrefetch:
            if blockNumber < len(self.fileobj.block_boundaries) and blockNumber not in self.blockCache:
                fetchedBlock = self.pool.apply_async(
                    ParallelXZReader._decodeBlock,
                    (self.filename, self.fileobj.block_boundaries[blockNumber], self._blockSize(blockNumber)),
                )
                self.blockCache[blockNumber] = fetchedBlock
                pendingBlocks += 1
        print("pending blocks:", pendingBlocks)

        print("Got blocks:", blocks)

        while blocks:
            block = blocks.pop(0)
            # Note that it is perfectly safe to call AsyncResult.get multiple times!
            toAppend = block.get()
            print(f"Append view ({firstBlockOffset},{ size}) of block of length {len(toAppend)}")
            if firstBlockOffset > 0:
                toAppend = toAppend[firstBlockOffset:]
            if not blocks:
                toAppend = toAppend[:size]
            firstBlockOffset = 0

            result += toAppend

        if blockNumber == 21:
            print("Result:", len(result))

        # TODO fall back to reading directly from fileobj if prefetch suggests nothing at all to improve latency!
        # self.fileobj.seek(self.offset)
        # result = self.fileobj.read(size)

        self.offset += len(result)
        return result

    @overrides(io.BufferedIOBase)
    def seek(self, offset: int, whence: int = io.SEEK_SET) -> int:
        if whence == io.SEEK_CUR:
            self.offset += offset
        elif whence == io.SEEK_END:
            self.offset = self.cumsizes[-1] + offset
        elif whence == io.SEEK_SET:
            self.offset = offset

        if self.offset < 0:
            raise ValueError("Trying to seek before the start of the file!")
        if self.offset >= self.cumsizes[-1]:
            return self.offset

        return self.offset

    @overrides(io.BufferedIOBase)
    def tell(self) -> int:
        return self.offset

Manual Shell Execution

base64 /dev/urandom | head -c $(( 4*1024*1024*1024  )) > large
xz -T 0 --keep large
python3 decompress-xz-parallel.py large.xz

Simple Extraction

Sorry I'm building a portable Android Pen testing Powershell script for Windows and I can't for the life of me get frida-server-15.2.2-android-x86.xz extracted... 7zip can extract it but I'm not looking to download yet another binary blob just for this xz file... Extracted it's a single ELF binary "frida-server-15.2.2-android-x86" Plz help...I can't find anything but this Python mod. My environment has Android Emulator running with Host running Java and Python. Everything else I can figure out but how to extract this xz file without installing 7zip....

import xz

with xz.open('frida-server-15.2.2-android-x86.xz') as f:
    print(f)
python 1.py
<XZFile object at 0x1044ab7aec>

Here is the project so far:
https://github.com/freeload101/Java-Android-Magisk-Burp-Objection-Root-Emulator-Easy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.