GithubHelp home page GithubHelp logo

nukemberg / cloudzip Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ozkatz/cloudzip

0.0 0.0 0.0 13.82 MB

list and get specific files from remote zip archives without downloading the whole thing

License: Apache License 2.0

Go 100.00%

cloudzip's Introduction

cz - Cloud Zip

list and get specific files from remote zip archives without downloading the whole thing

Tip

New: Experimental support for mounting a remote zip file as a local directory. See mounting below.

Installation

Download cloudzip

see releases for the latest release, you can download binaries directly from GitHub.

cz is available as a single binary, so no installation, simply stick it somewhere in your $PATH.

Building from source

Clone and build the project (no binaries available atm, sorry!)

git clone https://github.com/ozkatz/cloudzip.git
cd cloudzip
go build -o cz main.go

Then copy the cz binary into a location in your $PATH

cp cz /usr/local/bin/

Usage

Listing the contents of a zip file without downloading it:

cz ls s3://example-bucket/path/to/archive.zip

Printing a summary of the contents (number of files, total size compressed/uncompressed):

cz info s3://example-bucket/path/to/archive.zip

Downloading and extracting a specific object from within a zip file:

cz cat s3://example-bucket/path/to/archive.zip images/cat.png > cat.png

HTTP proxy mode (see below):

cz http s3://example-bucket/path

Mounting (See below):

cz mount s3://example-bucket/path/to/archive.zip some_dir/

Unmounting:

cz umount some_dir

Why does cz exist?

My use case was a pretty specific access pattern:

Upload lots of small (~1-100Kb) files as quickly as possible, while still allowing random access to them

How does cz solve this?

Well, uploading many small files to object stores is hard to do efficiently.

Bundling them as a large object and using multipart uploads to parallelize the upload while retaining bigger chunks is the most efficient way.

While this is commonly done with tar, the tar format doesn't keep an index of the files included in it. Scanning the archive until we find the file we're looking for means we might end up downloading the whole thing.

Zip, on the other hand, has a central directory, which is an index! It stores paths in the archive and their offset in the file.

This index, together with byte range requests (supported by all major object stores), allow reading a small file(s) from large archives without having to fetch the entire thing!

We can even write a zip file directly to remote storage without saving it locally:

zip -r - -0 * | aws s3 cp - "s3://example-bucket/path/to/archive.zip"

but what about CPU usage? Won't compression slow down the upload?

Zip files don't have to be compressed! zip -0 will result in an uncompressed archive, so there's no additional overhead.

How Does it Work?

cz ls

Listing is done by issuing 2 HTTP range requests:

  1. Fetch the last 64kB of the zip file, looking for the End Of Central Directory (EOCD), and possibly EOCD64.
  2. The EOCD contains the exact start offset and size of the Central Directory, which is then read by issuing another HTTP range request

Once the central directory is read, it is parsed and written to stdout, similar to the output of unzip -l.

cz cat

Reading a file from the remote zip involves another HTTP range request: once we have the central directory, we find the relevant entry for the file we wish to get, and figure out its offset and size. This is then used to issue a 3rd HTTP range request.

Because zip files store each file (whether compressed or not) independently, this is enough to uncompress and write the file to stdout.

⚠️ Experimental: cz http

CloudZip can run in proxy mode, allowing you to read archived files directly HTTP client (usually a browser).

cz http s3://example-bucket/path

This will open an HTTP server on a random port (use --listen to bind to another address). The server will map the requested path relative to the supplied S3 url argument. A single query argument filename should be supplied, referencing the file within the zip file. E.g. GET /a/b/c.zip?filename=foobar.png will serve foobar.png from within the s3://example-bucket/path/a/b/c.zip archive.

⚠️ Experimental: cz mount

Instead of listing and downloading individual files from the remote zip, you can now mount it to a local directory.

cz mount s3://example-bucket/path/to/archive.zip my_dir/

This would show up on your local filesystem as a directory with the contents of the zip archive inside it - as if you've downloaded and extracted it.

However... behind the scenes, it would fetch only the file listing from the remote zip (just like cz ls) and spin up a small NFS server, listening on localhost, and mount it to my_dir/.

When reading files from my_dir/, they will first be downloaded and decompressed on-the-fly, just like cz cat does.

These files are downloaded into a cache dir, which if not explicitly set, will be purged when unmounted. To set it to a specific location (and retain it across mount/umount cycles), set the CLOUDZIP_CACHE_DIR environment variable:

export CLOUDZIP_CACHE_DIR="/nvme/fast/cache"
cz mount s3://example-bucket/path/to/archive.zip my_dir/

To unmount:

cz umount my_dir

which will unmount the NFS share from the directory, and terminate the local NFS server for you.

Mounting, illustrated:

Demo

Mounting a 32GB dataset, directly from Kaggle's storage (See Kaggle usage below) as a local directory, with DuckDB reading a single file with ~1 second load time:

Caution

This is still experimental (and only supported on Linux and MacOS for now)

Logging

Set the $CLOUDZIP_LOGGING environment variable to DEBUG to log storage calls to stderr:

export CLOUDZIP_LOGGING="DEBUG"
cz ls s3://example-bucket/path/to/archive.zip  # will log S3 calls to stderr

Supported backends

AWS S3

Will use the default AWS credentials resolution order

Example:

cz ls s3://example-bucket/path/to/archive.zip

HTTP / HTTPS

Example:

cz ls https://example.com/path/to/archive.zip

Kaggle

Kaggle's Dataset Download API returns an URL for a zip file, so we can use it easily with cz! Before getting started, generate an API key and store the json file in ~/.kaggle/kaggle.json (see "Authentication" on the Kaggle API docs).

Alternatively, you can store the kaggle.json in a different location and set the KAGGLE_KEY_FILE environment variable with its path.

Example:

cz ls kaggle://{userSlug}/{datasetSlug}

For example, for the dataset at https://www.kaggle.com/datasets/datasnaek/youtube-new, the cz url should be kaggle://datasnaek/youtube-new.

lakeFS

lakeFS is fully supported. Will probe the lakeFS server for pre-signed URL support, and if supported will transparently use those. Otherwise, will fetch data through the API.

cz ls lakefs://repository/main/path/to/archive.zip

Local files

Prefix the path with file:// to read from the local filesystem. Can accept either relative path or absolute path.

Example:

cz ls file://archive.zip  # relative to current directory (./archive.zip)
cz ls file:///home/user/archive.zip  # absolute path (/home/user/archive.zip)

cloudzip's People

Contributors

ozkatz avatar nukemberg avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.