GithubHelp home page GithubHelp logo

caboodle's Introduction

caboodle

Build Status PyPI version License: MIT codecov Documentation Status

Are you tired of futzing with your cloud provider's storage API? Keeping track of file formats, serializing/deserializing, etc.? You just want to upload and download files to a bucket?

Caboodle abstracts away the details of artifact management in Python. It handles file syncing, authentication, and serialization/de-serialization and it does so in a cloud-provider agnostic manner.

Installation

pip install the-whole-caboodle    

Getting Started

See documentation and the concepts section below for more details.

The intended use case is some sort of cloud-based data science workload which has three steps:

  1. Download input data (artifacts) that you want to operate on. This may be a single file in bucket storage or an entire folder, and you want to operate on each file in that folder.
  2. Perform your computations.
  3. Upload your results (also artifacts) to the remote storage.

This three-step process could be part of a single step in a larger pipeline. Caboodle is meant to simplify steps (1) and (3).

1) Download Input Artifacts

To download all files from a folder in a remote bucket:

from caboodle import gcs
from caboodle.coffer import GCSCoffer

client = gcs.get_storage_client()
my_coffer = GCSCoffer("gs://mybucket/path/in/bucket", storage_client=client)
my_artifacts = my_coffer.download()
python_objects = [m.content for m in my_artifacts]

my_artifacts is a list containing Artifact objects which have a deserialize method which is used to read the raw data into python objects. The Coffer will attempt to infer file type and perform the appropriate serialization.

2) Perform Computations on Data

This is your job. Godspeed.

3) Upload Output Artifacts

To upload all files in a folder to a remote bucket:

from caboodle import gcs
from caboodle.coffer import GCSCoffer

client = gcs.get_storage_client()
my_coffer = GCSCoffer("gs://mybucket/path/in/bucket", storage_client=client)
my_coffer.upload(list_of_artifacts)

list_of_artifacts should be a list of Artifacts. See below for more information.

Concepts

Your data is serialized using the Artifact class, and these are stored in a Coffer which syncs that data with a remote storage provider. You use these two objects to upload and download various types of files from whatever cloud storage you use.

Artifacts

An artifact represents a blob of data that could be the input or output of some workload. An Artifact object contains logic for storing, serializing, and deserializing its contents. This gives you a single interface for saving data regardless of its type.

my_data = [1,2,3] # Some picklable python object
my_artifact = PickleArtifact("mydata", my_data)
my_artifact.serialize("my_data.pickle")

In this example, we created some object and save it using Python's pickle module to the file "my_data.pickle". Additionally, Artifacts have a key which 'Coffer's (discussed below) use to automatically make a filename when saving data.

You can analogously read in data from a file using the appropriate Artifact class:

my_data2 = my_artifact.deserialize("my_data.pickle")
print(my_data2)
>>> [1,2,3]

Currently, the following types of Artifacts have been implemented: pickle, Apache Avro, Fireworks, and binary.

Coffers

Whereas Artifacts handle serialization logic, a Coffer handles the logic of uploading Artifacts to a remote storage location:

client = caboodle.gcs.get_storage_client()
my_coffer = GCSCoffer("gs://mybucket/path/in/bucket", storage_client=client)
my_coffer.upload([my_artifact1, my_artifact2]) 

Analogously, you can download Artifacts:

downloaded_artifacts = my_coffer.download()
python_objects = [m.content for m in downloaded_artifacts]

The Coffer will attempt to infer filetype and construct the appropriate filetype for each Artifact (defaulting to binary). Thus, 'python_objects' is a list containing the deserialized artifacts that you initially uploaded.

Currently, only the GCSCoffer has been implemented, but in the future we will have analogous ones for AWS, Azure, and any other storage system.

Cloud-specific Utilities

caboodle.gcs contains helper functions for operations like uploading and downloading folders to Google Cloud Storage. It uses the official google-cloud-storage API.

Right now, only support for Google Cloud has been implemented. AWS and Azure will be added in the future.

See documentation for more details.

Contributing

Pull requests, questions, comments, and issues are welcome. See the issues tab for current tasks that need to be done. You can also reach me directly at [email protected]

caboodle's People

Contributors

smk508 avatar

Watchers

 avatar

caboodle's Issues

Implement an Azure Coffer

Coffers help decouple the semantics of where data is stored from the actual data. In general, you should be able to add any Artifacts to a Coffer, and that particular Coffer can decide how it will upload/download data to sync those artifacts.
Currently, there is only an implementation for Google Cloud Storage (GCS), but we could make one for Microsoft Azure storage as well.

Implement an S3 Coffer

Coffers help decouple the semantics of where data is stored from the actual data. In general, you should be able to add any Artifacts to a Coffer, and that particular Coffer can decide how it will upload/download data to sync those artifacts.
Currently, there is only an implementation for Google Cloud Storage (GCS), but we could make one for AWS S3 buckets as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.