GithubHelp home page GithubHelp logo

data's People

Contributors

harlantwood avatar jbenet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

harlantwood

data's Issues

Use a `.data` directory

The .data directory will allow:

  • hiding the Manifest file, which is not that useful to see lying around.
  • other metadata files (like local config .data/config to override global)
  • data commands to be run from dataset subdirectories
  • a safe (local) directory to place temporary files
  • potentially using an embedded git repo in the future (for detailed version history, etc)

Download Progress

Would be good to show download progress for large files. Chunking with a small enough chunksize might obviate the need, but it's probably good to have.

Something like:

get blob adbf522 train-labels-idx1-ubyte (X/45MB) 
get blob adbf522 train-labels-idx1-ubyte (X% of 45MB) 
get blob adbf522 train-labels-idx1-ubyte 45MB (X%)

Versioning Scheme

In software, we've come to use things like semver to ensure programs and their dependencies interoperate well.

Problems with this:

  • Forcing data publishers/researchers to follow a scheme does not seem fun or fruitful.
  • semver focuses on API changes, and does not apply well to data.

Paths:

  • Don't enforce anything. See what happens.
    This is flexible to "whatever you want to do." This is liable to yield a proliferating mess of "version" schemes. This seems like the worst thing to do.
  • Find an existing standard that makes sense and use it.
    Are there well established (and sane) data versioning standards? I'm not too familiar with what's out there.
  • data semver (or more researcher-friendly: Semantic Data Versioning), a semver fork tuned for data purposes. Perhaps something like this:
Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you REMOVE data.
MINOR version when you ADD data in a backwards-compatible manner, and
PATCH version when you CLEAN or REFORMAT data, without ADDING or REMOVING values.

Discussion welcome.

feedback regarding new user onboarding

While it wasn't clear how to proceed, the publishing workflow wasn't too difficult as a first-time user.

Steps required to publish baby's first data set:

data user add [desired_username]
cd directory_where_my_data_resides/
data publish

platform binaries

Ship platform-specific binaries.

make sure to cover:

  • debian/ubuntu - apt-get install data
  • osx - brew install data
  • windows... installer?

and, of course, if they have go installed:

go install github.com/data/data

data blob: interface should not use manifest.

As a lower plumbing command, data blob should not use manifest for { blob : path } mapping. The interface should really be:

  blob        Manage blobs in the blobstore
    put <hash> [<path>]   Upload blob named <hash> from <path> to blobstore.
    get <hash> [<path>]   Download blob named <hash> from blobstore to <path>.
    check <hash> [<path>] Verify blob contents in <path> match <hash>.
    url <hash>            Output Url for blob named by <hash>.
    show <hash>           Output blob contents named by <hash>.

Where missing [<path>] is replaced by stdin/out.

support `data get <URL>`

Originally, the goal was to allow data get to allow installation without the index. I.e. supporting:

data get <package tarball file>
data get <package tarball url>
data get <package directory>

In addition to the usual

data get # in a package dir, uses `Datafile.dependencies`
data get <owner>/<name> [--save]
data get <owner>/<name>@<version>  [--save]

--save adds the handle to Datafile.dependencies

Messages out of order

everett:sfmfp beholder$ data publish
==> Guided Data Package Publishing.

Welcome to Data Package Publishing. You should read these short
messages carefully, as they contain important information about
how data works, and how your data package will be published.

First, a 'data package' is a collection of files, containing:
- various files with your data, in any format.
- 'Datafile', a file with descriptive information about the package.
- 'Manifest', a file listing the other files in the package and their checksums.

This tool will automatically:
1. Create the package
  - Generate a 'Datafile', with information you will provide.
  - Generate a 'Manifest', with all the files in the current directory.
2. Upload the package contents
3. Publish the package to the index

(Note: to specify which files are part of the package, and other advanced
 features, use the 'data pack' command directly. See 'data pack help'.)

You are not logged in. First, either:

- Run 'data user add' to create a new user account.
- Run 'data user auth' to log in to an existing user account.


Why does publishing require a registered user account (and email)? The index
service needs to distinguish users to perform many of its tasks. For example:

- Verify who can or cannot publish datasets, or modify already published ones.
  (i.e. the creator + collaborators should be able to, others should not).
- Profiles credit people for the datasets they have published.
- Malicious users can be removed, and their email addresses blacklisted to
  prevent further abuse.

Disambiguate `data get [dataset]` and `data get` from Datafile

@jbenet,

What are your thoughts on disambiguating (1) fetching a single dataset from (2) fetching a collection from a Datafile? Is this something you've already considered and made a decision to avoid?

Proposing the following api modification:

$ data install
# downloads datasets given Datafile

This would pave the way for users to explicitly specify a Datafile without excessively overloading the get command.

$ data install [-arg to specify datafile] Datafile.staging

local blobstore? global?

data-blob doc describing more copied below.

Should there be a local blobstore separate from the working directory datasets?
Should it be global?

Implications:

  • no local blobstore (current):
    • pro: space saving? only one blob copy per file
    • pro: no suprises (no random extra data repositories lying about. wysiwyg.)
    • con: blobs are stored as the files they represent. can be deleted easily.
  • local blobstore:
    • pro: keeping working directory and repository separate confers git-like safety
    • con: duplicates all data on filesystem. bad as some will be massive.
  • local blobstore (global, 1 location per user, like go workspace):
    • pro: caching of blobs across all projects in machine.
    • pro: saves space
    • pro: fast
    • con: random (heavy) files added to a global spot in the machine
    • con: settings around the global blobstore
data blob - Manage blobs in the blobstore.

    Managing blobs means:

      put <hash>    Upload blob named by <hash> to blobstore.
      get <hash>    Download blob named by <hash> from blobstore.
      check <hash>  Verify blob contents named by <hash> match <hash>.
      show <hash>   Output blob contents named by <hash>.


    What is a blob?

    Datasets are made up of files, which are made up of blobs.
    (For now, 1 file is 1 blob. Chunking to be implemented)
    Blobs are basically blocks of data, which are checksummed
    (for integrity, de-duplication, and addressing) using a crypto-
    graphic hash function (sha1, for now). If git comes to mind,
    that's exactly right.

    Local Blobstores

    data stores blobs in blobstores. Every local dataset has a
    blobstore (local caching with links TBI). Like in git, the blobs
    are stored safely in the blobstore (different directory) and can
    be used to reconstruct any corrupted/deleted/modified dataset files.

    Remote Blobstores

    data uses remote blobstores to distribute datasets across users.
    The datadex service includes a blobstore (currently an S3 bucket).
    By default, the global datadex blobstore is where things are
    uploaded to and retrieved from.

    Since blobs are uniquely identified by their hash, maintaining one
    global blobstore helps reduce data redundancy. However, users can
    run their own datadex service. (The index and blobstore are tied
    together to ensure consistency. Please do not publish datasets to
    an index if blobs aren't in that index)

    data can use any remote blobstore you wish. (For now, you have to
    recompile, but in the future, you will be able to) Just change the
    datadex configuration variable. Or pass in "-s <url>" per command.

    (data-blob is part of the plumbing, lower level tools.
    Use it directly if you know what you're doing.)

Generate a Datafile with command like `data init`

Many package managers provide commands to generate blank manifests. This would be nice to have.

Examples:

# npm: Interactively create a package.json file
npm init
# Ruby: generate a blank Gemfile
$ bundle init

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.