qri-io / qri Goto Github PK
View Code? Open in Web Editor NEWyou're invited to a data party!
Home Page: https://qri.io
License: GNU General Public License v3.0
you're invited to a data party!
Home Page: https://qri.io
License: GNU General Public License v3.0
INITIAL ISSUE:
Any request to the API to fetch datasets comes back with a 500 status code
json response: { "meta": { "code": 500, "error": "invalid 'ipfs ref' path" } }
Happens when the qri electron app opens and the app calls to http://localhost:3000/datasets?page=1&pageSize=100
Happens when trying to search:
http://localhost:3000/search?p=movies&page=1&pageSize=100
This turned out to be qri had allowed something to be added to the dataset that wasn't a dataset.
Need better validation before allowing something to be added as a dataset
From dataset_sql source:
// Aggregates is a map of all aggregate functions.
var Aggregates = map[string]bool{
"avg": true,
"bit_and": true,
"bit_or": true,
"bit_xor": true,
"count": true,
"group_concat": true,
"max": true,
"min": true,
"std": true,
"stddev_pop": true,
"stddev_samp": true,
"stddev": true,
"sum": true,
"var_pop": true,
"var_samp": true,
"variance": true,
}
It'd be great if we could land support for these, as many are commonly-used functions that would make for solid demos
We need distributed lookup tables for datasets, metadata, and queries. This concept is currently under-developed, and needs to exist ASAP, so let's start by building a "local only" registry. This'll help think through the needs of the feature, while providing a way to demonstrate the CLI for now.
Once in place the query engine should check hash of query against the registry and avoid extra execution if a result is found, which will be a big win in-and-of itself.
Hash comparison of query results was lost in the refactor, need to get 'em back so we can dedupe queries.
queryString is currently being stored in it's abstract form, need to improve on that.
We should be able to "export" from the CLI in raw data and package formats, we'll then get this to work on the frontend as well
With the paper refactor comes the removal of any globally-accepted notion of "repositories", and the namespace convention that came with it. While this may be reintroduced in the future, for now we need to provide users a plausible way to identify & work with data.
As a first proposal we'll into a concept of "datasets" to the CLI, which are to be thought of as the users's personal collection of datasets. Users can name these datasets whatever they please (so long as names don't overlap), and can add
and remove
a dataset as needed.
The cli should have a few commands to support:
qri search [query]
-> Should search the network for datasets based on keywords or phrases, should display human-readable dataset info. For now this should just be based on a local registry, but display linked metadata.qri dataset add [name] [resource hash]
-> add a dataset to the user's current namespaceqri dataset remove [name]
-> remove a dataset from the users current namespceInitial Task list. This should be broken out into issues:
qri init
does--meta flag
qri init
against all identified resourcesqri should default to interacting with / creating a local IPFS node that it uses to resolve hashes over the network, and to add & pin content to the local node
need a way to disable all networking for local testing purposes
qri should ship with default datasets in it's library for demonstration purposes
Get codebase running against recently-merged dataset, dataset_sql repos
Need to check if electron gives us a method to call into go code.
Need to be able to initialize a qri dataset with metadata, will be necessary to complete #13
make an issue outlining a side project for kasey/an engineer learning the qri codebase to get an understanding of the Pandas DataFrame implementation (represents 2D tabular data/functionally similar to SQL but with 'pythonic' and numpy-inspired syntax/data structures/conventions) and assess the interoperability of low level functions and data structures of the qri engine/ the difficulty or feasibility of making a python-to-golang wrapper or adapter.
(this happens a lot when you convert a pandas dataframe to csv and forget to set its 'include index' option to False which then exports the index as the first column without a header)
Need a placeholder SQL fmt command that "formats" a parsed SQL AST.
Need Local capacity to edit the metadata of an existing dataset.
qri ds update [name/hash] -m [metadata file] -f [dataset file]
PUT /datasets/ipfs/:hash/metadata
& PUT /datasets/:hash/data
Need to implement the Pinner
interface on castore/ipfs
Right now it's pretty difficult to download & build qri, we should:
Need to bring webapp back up to speed post-refactor.
We need a basic search feature for qri, this means first building in the infrastructure to do search�. Later on we'll actually work out sending the search terms themselves across the network or something, but for now keeping a deduplicated list of dataset references seems like a good idea. Dataset histories are going to mess with that a bunch, but we'll cross that bridge later.
To start with, let's just get an "optionless" skeleton function that works when you run qri run
that will only do one thing, but prove the model & elucidate the path forward. It should:
prior version of qri used a strange variation on sql syntax for namespace purposes, need to restore normal sql syntax to the dataset_sql package
Currently all queries are pinned to the IPFS repo, we should make the default not save, and instead provide a --save
flag in the CLI.
pin
bool argument in the Put
methodPin
method to castore interfacedataset_sql.Exec
to not pin by defaultdataset_sql.Exec
should listen to the "save" arg in the QueryExecOptions
of a dataset and pin the dataset result hashqri init
used to be the way to run schema detection & validation on a dataset, and add datasets to their local namespace. We need to refactor this to work with the new "white paper refactored" code. qri init
should still run validation & schema detection, but this time successful dataset initialization should add the dataset, resource def & metadata to the local IPFS node & broadcast it's existence onto some sort of distributed dataset registry.
forgot the second dash on name and ran
qri init -f survey_cleaned.csv -m meta_survey_cleaned.json -name researchTools
which named the dataset 'ame' from 'name' rather than telling me '-name' was an invalid flag
This page contains a nice example comparison / mapping between POD spec & others. We should do the same for qri's metadata spec.
namespace names currently point to dataset.Structure
IPFS paths, they should point to dataset.Dataset
hashes.
currently castore just writes & pins all components of a datastore tree to the top level /ipfs/
, that's silly. We should save datasets as everything in the tree except the data itself, which should be a plain 'ol IPFS path. This'll depend on landing qri-io/cafs#1 & qri-io/cafs#2
This is a placeholder issue for thinking about building a robust data ingest pipeline for qri. Things we're interested in:
qri init CLI function
, if at all?Our CLI needs a test suite if we're going to avoid shipping regressions. We should set one up that operates out of os.TempDir()
repo package has fallen out of step with current thinking, need to have a big revisit on structure & interface.
it'd be nice to pipe data directly into the qri init function, and get this going as a general pattern these should work:
qri init < data.csv
qri run < query.sql
...
And so on
for example
this works
qri init \
-f 8d582c89-2a55-4647-922e-a696fe89d908.csv \
-m 8d582c89-2a55-4647-922e-a696fe89d908.json \
-n 8d582c89-2a55-4647-922e-a696fe89d908
but then these fail
qri run "select * from 8d582c89-2a55-4647-922e-a696fe89d908"
qri run "select * from '8d582c89-2a55-4647-922e-a696fe89d908'"
right now it errors with "IPFS lockfile blah blah blah", would be much nicer if it just said "hey are you already running an IPFS daemon?"
add fields to dataset:
modifications to dataset:
modifications to json at processsing time:
Things would be greatly simplified if we had a single save function for a dataset in the dataset package that accepted a castore as it's only argument. Let's do that.
We need a way to show the queries that have been run on a given dataset. qri queries [dataset alias or hash]
should list queries that have been asked of this dataset, showing the row & col count of their results.
(output)
osterbit Desktop $ qri init
panic: runtime error: index out of range
goroutine 1 [running]:
github.com/qri-io/qri/cmd.glob..func8(0x250ab20, 0x255aec0, 0x0, 0x0)
/Users/b5/go/src/github.com/qri-io/qri/cmd/init.go:45 +0xcb5
github.com/spf13/cobra.(*Command).execute(0x250ab20, 0x255aec0, 0x0, 0x0, 0x250ab20, 0x255aec0)
/Users/b5/go/src/github.com/spf13/cobra/command.go:651 +0x23a
github.com/spf13/cobra.(*Command).ExecuteC(0x250b3a0, 0x1, 0x0, 0x0)
/Users/b5/go/src/github.com/spf13/cobra/command.go:726 +0x339
github.com/spf13/cobra.(*Command).Execute(0x250b3a0, 0x0, 0x11)
/Users/b5/go/src/github.com/spf13/cobra/command.go:685 +0x2b
github.com/qri-io/qri/cmd.Execute()
/Users/b5/go/src/github.com/qri-io/qri/cmd/root.go:47 +0x2d
main.main()
/Users/b5/go/src/github.com/qri-io/qri/main.go:20 +0x20
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.