jbenet / data Goto Github PK
View Code? Open in Web Editor NEWpackage manager for datasets
package manager for datasets
Ship platform-specific binaries.
make sure to cover:
apt-get install data
brew install data
and, of course, if they have go installed:
go install github.com/data/data
While it wasn't clear how to proceed, the publishing workflow wasn't too difficult as a first-time user.
Steps required to publish baby's first data set:
data user add [desired_username]
cd directory_where_my_data_resides/
data publish
Originally, the goal was to allow data get
to allow installation without the index. I.e. supporting:
data get <package tarball file>
data get <package tarball url>
data get <package directory>
In addition to the usual
data get # in a package dir, uses `Datafile.dependencies`
data get <owner>/<name> [--save]
data get <owner>/<name>@<version> [--save]
--save
adds the handle to Datafile.dependencies
data-blob doc describing more copied below.
Should there be a local blobstore separate from the working directory datasets?
Should it be global?
Implications:
data blob - Manage blobs in the blobstore.
Managing blobs means:
put <hash> Upload blob named by <hash> to blobstore.
get <hash> Download blob named by <hash> from blobstore.
check <hash> Verify blob contents named by <hash> match <hash>.
show <hash> Output blob contents named by <hash>.
What is a blob?
Datasets are made up of files, which are made up of blobs.
(For now, 1 file is 1 blob. Chunking to be implemented)
Blobs are basically blocks of data, which are checksummed
(for integrity, de-duplication, and addressing) using a crypto-
graphic hash function (sha1, for now). If git comes to mind,
that's exactly right.
Local Blobstores
data stores blobs in blobstores. Every local dataset has a
blobstore (local caching with links TBI). Like in git, the blobs
are stored safely in the blobstore (different directory) and can
be used to reconstruct any corrupted/deleted/modified dataset files.
Remote Blobstores
data uses remote blobstores to distribute datasets across users.
The datadex service includes a blobstore (currently an S3 bucket).
By default, the global datadex blobstore is where things are
uploaded to and retrieved from.
Since blobs are uniquely identified by their hash, maintaining one
global blobstore helps reduce data redundancy. However, users can
run their own datadex service. (The index and blobstore are tied
together to ensure consistency. Please do not publish datasets to
an index if blobs aren't in that index)
data can use any remote blobstore you wish. (For now, you have to
recompile, but in the future, you will be able to) Just change the
datadex configuration variable. Or pass in "-s <url>" per command.
(data-blob is part of the plumbing, lower level tools.
Use it directly if you know what you're doing.)
The .data
directory will allow:
Manifest
file, which is not that useful to see lying around..data/config
to override global)In software, we've come to use things like semver to ensure programs and their dependencies interoperate well.
Problems with this:
Paths:
Given a version number MAJOR.MINOR.PATCH, increment the:
MAJOR version when you REMOVE data.
MINOR version when you ADD data in a backwards-compatible manner, and
PATCH version when you CLEAN or REFORMAT data, without ADDING or REMOVING values.
Discussion welcome.
everett:sfmfp beholder$ data publish
==> Guided Data Package Publishing.
Welcome to Data Package Publishing. You should read these short
messages carefully, as they contain important information about
how data works, and how your data package will be published.
First, a 'data package' is a collection of files, containing:
- various files with your data, in any format.
- 'Datafile', a file with descriptive information about the package.
- 'Manifest', a file listing the other files in the package and their checksums.
This tool will automatically:
1. Create the package
- Generate a 'Datafile', with information you will provide.
- Generate a 'Manifest', with all the files in the current directory.
2. Upload the package contents
3. Publish the package to the index
(Note: to specify which files are part of the package, and other advanced
features, use the 'data pack' command directly. See 'data pack help'.)
You are not logged in. First, either:
- Run 'data user add' to create a new user account.
- Run 'data user auth' to log in to an existing user account.
Why does publishing require a registered user account (and email)? The index
service needs to distinguish users to perform many of its tasks. For example:
- Verify who can or cannot publish datasets, or modify already published ones.
(i.e. the creator + collaborators should be able to, others should not).
- Profiles credit people for the datasets they have published.
- Malicious users can be removed, and their email addresses blacklisted to
prevent further abuse.
Many package managers provide commands to generate blank manifests. This would be nice to have.
Examples:
# npm: Interactively create a package.json file
npm init
# Ruby: generate a blank Gemfile
$ bundle init
Would be good to show download progress for large files. Chunking with a small enough chunksize might obviate the need, but it's probably good to have.
Something like:
get blob adbf522 train-labels-idx1-ubyte (X/45MB)
get blob adbf522 train-labels-idx1-ubyte (X% of 45MB)
get blob adbf522 train-labels-idx1-ubyte 45MB (X%)
should:
sets on the datafile before packing/uploading.
where does publish go?
As a lower plumbing command, data blob should not use manifest for { blob : path } mapping. The interface should really be:
blob Manage blobs in the blobstore
put <hash> [<path>] Upload blob named <hash> from <path> to blobstore.
get <hash> [<path>] Download blob named <hash> from blobstore to <path>.
check <hash> [<path>] Verify blob contents in <path> match <hash>.
url <hash> Output Url for blob named by <hash>.
show <hash> Output blob contents named by <hash>.
Where missing [<path>]
is replaced by stdin/out.
data get
with no args should install dependencies listed in Datafile, like npm install
What are your thoughts on disambiguating (1) fetching a single dataset from (2) fetching a collection from a Datafile? Is this something you've already considered and made a decision to avoid?
Proposing the following api modification:
$ data install
# downloads datasets given Datafile
This would pave the way for users to explicitly specify a Datafile without excessively overloading the get
command.
$ data install [-arg to specify datafile] Datafile.staging
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.