GithubHelp home page GithubHelp logo

qri-io / qri Goto Github PK

View Code? Open in Web Editor NEW
1.1K 27.0 66.0 44.13 MB

you're invited to a data party!

Home Page: https://qri.io

License: GNU General Public License v3.0

Go 99.79% Makefile 0.07% Dockerfile 0.06% Shell 0.07%
golang service data-science ipfs p2p web3 opendata qri trust dataset

qri's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

qri's Issues

Need better validation before adding to dataset

INITIAL ISSUE:
Any request to the API to fetch datasets comes back with a 500 status code
json response: { "meta": { "code": 500, "error": "invalid 'ipfs ref' path" } }

Happens when the qri electron app opens and the app calls to http://localhost:3000/datasets?page=1&pageSize=100

Happens when trying to search:
http://localhost:3000/search?p=movies&page=1&pageSize=100

This turned out to be qri had allowed something to be added to the dataset that wasn't a dataset.
Need better validation before allowing something to be added as a dataset

  • check for collisions

sql aggregate function support

From dataset_sql source:

// Aggregates is a map of all aggregate functions.
var Aggregates = map[string]bool{
	"avg":          true,
	"bit_and":      true,
	"bit_or":       true,
	"bit_xor":      true,
	"count":        true,
	"group_concat": true,
	"max":          true,
	"min":          true,
	"std":          true,
	"stddev_pop":   true,
	"stddev_samp":  true,
	"stddev":       true,
	"sum":          true,
	"var_pop":      true,
	"var_samp":     true,
	"variance":     true,
}

It'd be great if we could land support for these, as many are commonly-used functions that would make for solid demos

Initial Resource Definition, Metadata, and Query registry

We need distributed lookup tables for datasets, metadata, and queries. This concept is currently under-developed, and needs to exist ASAP, so let's start by building a "local only" registry. This'll help think through the needs of the feature, while providing a way to demonstrate the CLI for now.

Once in place the query engine should check hash of query against the registry and avoid extra execution if a result is found, which will be a big win in-and-of itself.

Revised Query Result hashes

Hash comparison of query results was lost in the refactor, need to get 'em back so we can dedupe queries.

Export & Download

We should be able to "export" from the CLI in raw data and package formats, we'll then get this to work on the frontend as well

Refactored Namespace/Dataset reference functionality

With the paper refactor comes the removal of any globally-accepted notion of "repositories", and the namespace convention that came with it. While this may be reintroduced in the future, for now we need to provide users a plausible way to identify & work with data.

As a first proposal we'll into a concept of "datasets" to the CLI, which are to be thought of as the users's personal collection of datasets. Users can name these datasets whatever they please (so long as names don't overlap), and can add and remove a dataset as needed.

The cli should have a few commands to support:

  • qri search [query] -> Should search the network for datasets based on keywords or phrases, should display human-readable dataset info. For now this should just be based on a local registry, but display linked metadata.
  • qri dataset add [name] [resource hash] -> add a dataset to the user's current namespace
  • qri dataset remove [name] -> remove a dataset from the users current namespce

Initial qri data.gov ingest test

Initial Task list. This should be broken out into issues:

  • Start with a download of data.gov linked data
  • Filter for only direct references to csv files
  • Spot check metadata for those references
  • de-duplicate?
  • work with @b5 to understand what qri init does
  • figure out how to add metadata on qri-init: --meta flag
  • map list of fields from data.gov entries to qri metadata entries
  • determine size of ingest (for sanity's sake), possibly filtering out massive datasets
  • [ ] Reduce that set to only epa.gov (for now)
  • build a script that downloads and runs qri init against all identified resources

Native integration with local IPFS node

qri should default to interacting with / creating a local IPFS node that it uses to resolve hashes over the network, and to add & pin content to the local node

no-network flag

need a way to disable all networking for local testing purposes

meta issue to write an issue about frame.py

https://github.com/pandas-dev/pandas/blob/64c8a8d6fecacb796da8265ace870a4fcab98092/pandas/core/frame.py

make an issue outlining a side project for kasey/an engineer learning the qri codebase to get an understanding of the Pandas DataFrame implementation (represents 2D tabular data/functionally similar to SQL but with 'pythonic' and numpy-inspired syntax/data structures/conventions) and assess the interoperability of low level functions and data structures of the qri engine/ the difficulty or feasibility of making a python-to-golang wrapper or adapter.

CRUD Dataset Metadata

Need Local capacity to edit the metadata of an existing dataset.

  • new command: qri ds update [name/hash] -m [metadata file] -f [dataset file]
  • expose this same functionality as an api command PUT /datasets/ipfs/:hash/metadata & PUT /datasets/:hash/data

Need instructions for installing & building

Right now it's pretty difficult to download & build qri, we should:

  • map the steps to make construct a build
  • simplify that list if at all possible with things like shell scripts
  • add installation instructions to the readme

Basic p2p & local Dataset Search

We need a basic search feature for qri, this means first building in the infrastructure to do search�. Later on we'll actually work out sending the search terms themselves across the network or something, but for now keeping a deduplicated list of dataset references seems like a good idea. Dataset histories are going to mess with that a bunch, but we'll cross that bridge later.

  • p2p dataset list exchange
  • local dataset caching
  • regex-based dataset search
  • CLI-based results display

Basic working skeleton function

To start with, let's just get an "optionless" skeleton function that works when you run qri run that will only do one thing, but prove the model & elucidate the path forward. It should:

  1. Construct an IPFS node with networking deactivated, connected to an fsrepo as a backing store.
  2. Add structured data to the IPFS repo, retrieving a hash
  3. Build a resource using the resulting hash that properly describes the structured data
  4. Construct a Query on that data
  5. Add the Query to the repo, returning the hash
  6. Execute the query
  7. Add the resulting resource & structured data to the repo
  8. Link the query hash to the resulting resource hash in a query lookup table
  9. output the resulting data to the console.

Restore regular SQL syntax

prior version of qri used a strange variation on sql syntax for namespace purposes, need to restore normal sql syntax to the dataset_sql package

add default no-save option for query execution

Currently all queries are pinned to the IPFS repo, we should make the default not save, and instead provide a --save flag in the CLI.

  • finish #29
  • adjust castore interface to accept a pin bool argument in the Put method
  • add a Pin method to castore interface
  • modify dataset_sql.Exec to not pin by default
  • a methods that wrap dataset_sql.Exec should listen to the "save" arg in the QueryExecOptions of a dataset and pin the dataset result hash

Refactored qri init

qri init used to be the way to run schema detection & validation on a dataset, and add datasets to their local namespace. We need to refactor this to work with the new "white paper refactored" code. qri init should still run validation & schema detection, but this time successful dataset initialization should add the dataset, resource def & metadata to the local IPFS node & broadcast it's existence onto some sort of distributed dataset registry.

Structure datasets as pathable trees

currently castore just writes & pins all components of a datastore tree to the top level /ipfs/, that's silly. We should save datasets as everything in the tree except the data itself, which should be a plain 'ol IPFS path. This'll depend on landing qri-io/cafs#1 & qri-io/cafs#2

Qri ingest pipeline

This is a placeholder issue for thinking about building a robust data ingest pipeline for qri. Things we're interested in:

  • metadata type detection (eg. "That's a project-open-data file, or "that's a dcat file")
  • if we do get a recognized metadata type, validate that schema, warn user if invalid
  • attempt to uncover data download link and auto-acquire data from there
  • Harvard FITS-based content detection
  • checking for presence of the data on the network
  • solid generic fallbacks
  • we want to be able to retry things
  • we want a solid set of "Checkpoints" for UX, where users can interrupt the ingest process where sensible
  • configurable settings for batching, solid defaults for batching, output to logs for batching.
  • resumable batch import based on logs from previous imports
  • configurable network-based metadata cross-referencing
  • metadata title / other field based search / matching
  • transaction support, should be able to cancel a partially-completed process & have it revert to it's prior state

Resources

  • Think about library "acession" pipeline, this is well covered territory in the libraries space.

Outstanding questions

  • where does this run?
  • how does this integrate with the baseline qri init CLI function, if at all?
  • do we host this as a central service, but publish the results to the d web?
  • ML based metadata inference?

[cli] test suite

Our CLI needs a test suite if we're going to avoid shipping regressions. We should set one up that operates out of os.TempDir()

[cli] pipe data directly into qri commands

it'd be nice to pipe data directly into the qri init function, and get this going as a general pattern these should work:

qri init < data.csv
qri run < query.sql
...
And so on

modifications to the dataset definition

add fields to dataset:

  • identifier
  • language (using iso 2 letter language codes--check if that is currently in use)

modifications to dataset:

  • change license to string
  • add theme object
  • add accessUrl, downloadUrl, remove generic 'url'

modifications to json at processsing time:

  • change keyword to list of strings using the name property of each object
  • format (check values, then to_lower)
  • remove any semicolons, whitespace

dataset.Dataset.Save

Things would be greatly simplified if we had a single save function for a dataset in the dataset package that accepted a castore as it's only argument. Let's do that.

  • refactor castore interface to actually reflect a content-addressed store
  • create an in-memory castore for testing purposes
  • refactor castore/ipfs to conform to the new interface
  • change dataset package to pointer references
  • add dataset.Dataset.Save method to dataset package, have it properly pull apart the dataset components into references

Dataset queries command

We need a way to show the queries that have been run on a given dataset. qri queries [dataset alias or hash] should list queries that have been asked of this dataset, showing the row & col count of their results.

[cli] ipfs init fails without error message when no arg is provided

(output)

osterbit Desktop $ qri init
panic: runtime error: index out of range

goroutine 1 [running]:
github.com/qri-io/qri/cmd.glob..func8(0x250ab20, 0x255aec0, 0x0, 0x0)
	/Users/b5/go/src/github.com/qri-io/qri/cmd/init.go:45 +0xcb5
github.com/spf13/cobra.(*Command).execute(0x250ab20, 0x255aec0, 0x0, 0x0, 0x250ab20, 0x255aec0)
	/Users/b5/go/src/github.com/spf13/cobra/command.go:651 +0x23a
github.com/spf13/cobra.(*Command).ExecuteC(0x250b3a0, 0x1, 0x0, 0x0)
	/Users/b5/go/src/github.com/spf13/cobra/command.go:726 +0x339
github.com/spf13/cobra.(*Command).Execute(0x250b3a0, 0x0, 0x11)
	/Users/b5/go/src/github.com/spf13/cobra/command.go:685 +0x2b
github.com/qri-io/qri/cmd.Execute()
	/Users/b5/go/src/github.com/qri-io/qri/cmd/root.go:47 +0x2d
main.main()
	/Users/b5/go/src/github.com/qri-io/qri/main.go:20 +0x20

Query History Log

Once #30 lands, we should think about a qri queries command that shows a historical log of queries run. Let's construct the answer to #30 with this in mind.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.