jbenet / data Goto Github PK

View Code? Open in Web Editor NEW

59.0 59.0 1.0 797 KB

package manager for datasets

Makefile 2.32% Go 91.31% Python 1.76% HTML 2.43% Ruby 2.18%

data's Issues

platform binaries

Ship platform-specific binaries.

make sure to cover:

debian/ubuntu - apt-get install data
osx - brew install data
windows... installer?

and, of course, if they have go installed:

go install github.com/data/data

feedback regarding new user onboarding

While it wasn't clear how to proceed, the publishing workflow wasn't too difficult as a first-time user.

Steps required to publish baby's first data set:

data user add [desired_username]
cd directory_where_my_data_resides/
data publish

support `data get <URL>`

Originally, the goal was to allow data get to allow installation without the index. I.e. supporting:

data get <package tarball file>
data get <package tarball url>
data get <package directory>

In addition to the usual

data get # in a package dir, uses `Datafile.dependencies`
data get <owner>/<name> [--save]
data get <owner>/<name>@<version>  [--save]

--save adds the handle to Datafile.dependencies

local blobstore? global?

data-blob doc describing more copied below.

Should there be a local blobstore separate from the working directory datasets?
Should it be global?

Implications:

no local blobstore (current):
- pro: space saving? only one blob copy per file
- pro: no suprises (no random extra data repositories lying about. wysiwyg.)
- con: blobs are stored as the files they represent. can be deleted easily.
local blobstore:
- pro: keeping working directory and repository separate confers git-like safety
- con: duplicates all data on filesystem. bad as some will be massive.
local blobstore (global, 1 location per user, like go workspace):
- pro: caching of blobs across all projects in machine.
- pro: saves space
- pro: fast
- con: random (heavy) files added to a global spot in the machine
- con: settings around the global blobstore

data blob - Manage blobs in the blobstore.

    Managing blobs means:

      put <hash>    Upload blob named by <hash> to blobstore.
      get <hash>    Download blob named by <hash> from blobstore.
      check <hash>  Verify blob contents named by <hash> match <hash>.
      show <hash>   Output blob contents named by <hash>.


    What is a blob?

    Datasets are made up of files, which are made up of blobs.
    (For now, 1 file is 1 blob. Chunking to be implemented)
    Blobs are basically blocks of data, which are checksummed
    (for integrity, de-duplication, and addressing) using a crypto-
    graphic hash function (sha1, for now). If git comes to mind,
    that's exactly right.

    Local Blobstores

    data stores blobs in blobstores. Every local dataset has a
    blobstore (local caching with links TBI). Like in git, the blobs
    are stored safely in the blobstore (different directory) and can
    be used to reconstruct any corrupted/deleted/modified dataset files.

    Remote Blobstores

    data uses remote blobstores to distribute datasets across users.
    The datadex service includes a blobstore (currently an S3 bucket).
    By default, the global datadex blobstore is where things are
    uploaded to and retrieved from.

    Since blobs are uniquely identified by their hash, maintaining one
    global blobstore helps reduce data redundancy. However, users can
    run their own datadex service. (The index and blobstore are tied
    together to ensure consistency. Please do not publish datasets to
    an index if blobs aren't in that index)

    data can use any remote blobstore you wish. (For now, you have to
    recompile, but in the future, you will be able to) Just change the
    datadex configuration variable. Or pass in "-s <url>" per command.

    (data-blob is part of the plumbing, lower level tools.
    Use it directly if you know what you're doing.)

make sure http:// is in urls for indexes

Use a `.data` directory

The .data directory will allow:

hiding the Manifest file, which is not that useful to see lying around.
other metadata files (like local config .data/config to override global)
data commands to be run from dataset subdirectories
a safe (local) directory to place temporary files
potentially using an embedded git repo in the future (for detailed version history, etc)

Versioning Scheme

In software, we've come to use things like semver to ensure programs and their dependencies interoperate well.

Problems with this:

Forcing data publishers/researchers to follow a scheme does not seem fun or fruitful.
semver focuses on API changes, and does not apply well to data.

Paths:

Don't enforce anything. See what happens.
This is flexible to "whatever you want to do." This is liable to yield a proliferating mess of "version" schemes. This seems like the worst thing to do.
Find an existing standard that makes sense and use it.
Are there well established (and sane) data versioning standards? I'm not too familiar with what's out there.
data semver (or more researcher-friendly: Semantic Data Versioning), a semver fork tuned for data purposes. Perhaps something like this:

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you REMOVE data.
MINOR version when you ADD data in a backwards-compatible manner, and
PATCH version when you CLEAN or REFORMAT data, without ADDING or REMOVING values.

Discussion welcome.

Messages out of order

everett:sfmfp beholder$ data publish
==> Guided Data Package Publishing.

Welcome to Data Package Publishing. You should read these short
messages carefully, as they contain important information about
how data works, and how your data package will be published.

First, a 'data package' is a collection of files, containing:
- various files with your data, in any format.
- 'Datafile', a file with descriptive information about the package.
- 'Manifest', a file listing the other files in the package and their checksums.

This tool will automatically:
1. Create the package
  - Generate a 'Datafile', with information you will provide.
  - Generate a 'Manifest', with all the files in the current directory.
2. Upload the package contents
3. Publish the package to the index

(Note: to specify which files are part of the package, and other advanced
 features, use the 'data pack' command directly. See 'data pack help'.)

You are not logged in. First, either:

- Run 'data user add' to create a new user account.
- Run 'data user auth' to log in to an existing user account.


Why does publishing require a registered user account (and email)? The index
service needs to distinguish users to perform many of its tasks. For example:

- Verify who can or cannot publish datasets, or modify already published ones.
  (i.e. the creator + collaborators should be able to, others should not).
- Profiles credit people for the datasets they have published.
- Malicious users can be removed, and their email addresses blacklisted to
  prevent further abuse.

Generate a Datafile with command like `data init`

Many package managers provide commands to generate blank manifests. This would be nice to have.

Examples:

# npm: Interactively create a package.json file
npm init

# Ruby: generate a blank Gemfile
$ bundle init

'data config edit' should use $EDITOR

Download Progress

Would be good to show download progress for large files. Chunking with a small enough chunksize might obviate the need, but it's probably good to have.

Something like:

get blob adbf522 train-labels-idx1-ubyte (X/45MB) 
get blob adbf522 train-labels-idx1-ubyte (X% of 45MB) 
get blob adbf522 train-labels-idx1-ubyte 45MB (X%)

implement data blob check

data manifest: merge add/hash

bug: data pack check fails if files are deleted.

'data manifest' on empty dir

should:

create empty manifest
warn user it's empty.

`data publish` should require a version

sets on the datafile before packing/uploading.

implement data blob hash

data publish goes where?

where does publish go?

data blob: interface should not use manifest.

As a lower plumbing command, data blob should not use manifest for { blob : path } mapping. The interface should really be:

  blob        Manage blobs in the blobstore
    put <hash> [<path>]   Upload blob named <hash> from <path> to blobstore.
    get <hash> [<path>]   Download blob named <hash> from blobstore to <path>.
    check <hash> [<path>] Verify blob contents in <path> match <hash>.
    url <hash>            Output Url for blob named by <hash>.
    show <hash>           Output blob contents named by <hash>.

Where missing [<path>] is replaced by stdin/out.

`data get` + dependencies

data get with no args should install dependencies listed in Datafile, like npm install

Disambiguate `data get [dataset]` and `data get` from Datafile

@jbenet,

What are your thoughts on disambiguating (1) fetching a single dataset from (2) fetching a collection from a Datafile? Is this something you've already considered and made a decision to avoid?

Proposing the following api modification:

$ data install
# downloads datasets given Datafile

This would pave the way for users to explicitly specify a Datafile without excessively overloading the get command.

$ data install [-arg to specify datafile] Datafile.staging

jbenet / data Goto Github PK

data's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs