frictionlessdata / frictionlessdata.io Goto Github PK

The main repository of the Frictionless Data project. Website, issues, and discussions

License: MIT License

Vue 65.07% JavaScript 32.13% Stylus 1.92% CSS 0.89%

frictionlessdata.io's Introduction

Frictionless Data

This is a repo for managing the Frictionless project – https://frictionlessdata.io/. As such it is more core team focused. 😄

Frictionless Data: https://frictionlessdata.io/
Specifications: https://specs.frictionlessdata.io/
Discussion forums: https://github.com/frictionlessdata/project/discussions
Chat: Discord https://discord.gg/2UgfM2k
Code of Conduct: https://frictionlessdata.io/code-of-conduct/

Want to cite this repo? Please use this DOI:

Where to open issues?

Summary:

Discussions: https://github.com/frictionlessdata/project/discussions – the default place for anyone out there to open an issue. It means one place for people to do that. Discussions is a welcoming place where people can open a first issue, have a look around, etc. If those issues are substantive, they may move elsewhere (e.g. I did that with a spec issue recently)... But it's a good landing spot.
- Also the catch-all for general questions, suggestions, ideas, support requests, etc.
Project: https://github.com/frictionlessdata/project/issues – the default place for core team to organize and plan work, schedule sprints, etc. NOT for general discussion, ideas, support, etc.
Specs: https://github.com/frictionlessdata/specs/issues

How to contribute to the website

This is the new FrictionlessData.io website to be released in 2020. It reflects the recent updates made to Frictionless Data project setup and brand.

Development

$ npm install
$ npm start

Deployment

New commits into the master branch will be automatically deployed to GitHub Pages by a workflow.

frictionlessdata.io's People

Contributors

Stargazers

Watchers

frictionlessdata.io's Issues

Display Views (graphs, maps, grids) for Data Packages

Introduce concept of a view (as per data explorer) and support this on data page

datapackage.json generator and validator

Example Data Packages and SDF datasets

site integration
examples

Site Integration

For gists we can use classic embed
For full git repos can use http://gist-it.appspot.com/

Examples

Non-SDF

~~#63 - Google spreadsheet e.g. Timeline data? Nigeria oil spills? Spending data?~~
External - spending data - uk25k - https://github.com/openspending/dpkg-uk25k
~~GeoJSON - #64 (External - geo data - https://github.com/john-sandall/geocouncil)~~
External - topojson (natural earth derived) - https://github.com/mbostock/topojson/tree/master/examples
- In progress at https://github.com/datasets/ex-topojson
Creating a datapackage for data your don't control - it's possible

SDF

Can use what we have in http://data.okfn.org/data though may want to point directly to git repos

Use readme (support for markdown)

DataPackage.json Creator for gists

I'm exploring using gists as a way for novices to group small numbers of simple data files together. Creating the datapackage file is a pain, so I wondered whether the datapackage.json creator could automatically generate metadata from gists directly? For example, https://gist.github.com/psychemedia/5633865.json describes a gist with several files:
{ //...
files: [
"BANGLADESH-RMG-EXPORT-TO-WORLD-BY-COUNTRY.csv",
"BANGLADESH-RMG-EXPORTS-TO-WORLD-10-12-MONTHLY.csv",
"BANGLADESH-RMG-EXPORTS-TO-WORLD-11-13-MONTHLY.csv",
"BGEA-MEMBERSHIP-AND-EMPLOYMENT.csv",
"BGMEA-COMPARATIVE-STATEMENT-ON-EXPORT-OF-RMG-AND-TOTAL-EXPORT-OF-BANGLADESH.csv",
"MAIN-APPAREL-ITEMS-EXPORTED-FROM-BANGLADESH.csv",
"VALUE-AND-QUANTITY-OF-TOTAL-APPAREL-EXPORT-FISCAL.csv",
"VALUE-AND-QUANTITY-OF-TOTAL-APPAREL-EXPORT.csv",
"datapackage.json"
]
//...
}

(the datapackage there was an early attempt at handcrafting part of the datapackage description;-(

Would it make sense for a series of plugins based around URL detection, so for example:
i) paste in url='https://gist.github.com/psychemedia/5633865'
ii) identify 'https://gist.github.com/' from url
iii) retrieve url+'.json'
iv) loop through files[], identify .csv and generate combined datapackage json
v) display datapackage.json for user to add to gist, or
vi) provide the datapackage.json generation as service (eg call http://data.okfn.org/tools/dp/api/create?url=https://gist.github.com/psychemedia/5633865 and get the datapackage.json back as json?

Looking at the gist JSON, there is also a 'description' element that could be used as a the datapackage description. The original URL could default set the name set on URL properties (eg gist-psychemedia-5633865)

Field Summary on data show page

Summarize the fields in a resource on the data view page.

Data Package and Simple Data Format intro pages on site

Refactor site content and structure to focus on the data

More like we were originally (before we went down the frictionless data route)

As per @pudo's comments the tools and standards can be seen as more a support for the core data.

Switch from python back to pure JS

Why?

Keep it ultra-simple
Easier to deploy and modify
The API can be separate app any way ...

Plus we can always use nodejs for the SEO part of things and having proper urls ...

Pro / Cons

For Browser JS

Ease of deployment via e.g. gh-pages (but Heroku ain't that bad)
Can be used with any index.json (so could be used locally) - BUT YAGNI ...
Easier for people to hack on (relatively small benefit)
Native JS + JSON ...

Plus NodeJS

Get all the benefits of python

For Python

Integrate API - BUT API will be separate?
SEO
Proper redirects, more mature ...
Now doing it in python ...

Data Packages page

Contents of current about page

Add analytics tracking

Simple is just page

More complex would include tracking click links

DataPackage.json creator

Given a CSV file help me create a DataPackage.json (interactively)

key fields: name, title, license (defaults of open data commons ...)
schema for resources ..
JSON API
HTML page + interactive form

Locate at: /tools/dp/creator (or just /tools/creator)?

Bonus points for:

type guessing (for fields)
POST support

Refactor to be Python Flask based

Postgresql Import

Reconciliation API for Reference Data

Suggest we just push to nomenklatura and link to there.

cf https://github.com/OpenRefine/OpenRefine/wiki/Reconciliation-Service-Api

Google Spreadsheet Import

Create appscript code
- import function
- simple UI to provide item
Instructions for using this in your spreadsheet
Publish a script which uses the library to the gallery so it can be auto-installed - https://developers.google.com/apps-script/publishing_gallery

support for embed not available

When we use your http://data.okfn.org/ Here http://senegalouvert.org/ , I have problem to active The ---embed for graph --- for selected data http://senegalouvert.org/data ,
http://senegalouvert.org/data/Observatoires-Internet
Any Help Will be appreciate .

Incorporate README as description (or readme?) into registry/index.json

Also means we need markdown support (in end user code?)

Data version API

I would like to be able to use data.okfn.org as an intermediary between my software and the data packages it uses and be able to quickly check whether there's a new version available of the data (e.g. if I've cached the package on a local machine).

There are ways to do it with the current setup:

Download the datapackage.json descriptor file, parse it and get the version there and check it against my local version. Problems:
- This solution relies on humans and that they update their version but there might not be any consistency in it since the data package standard describes the version attribute as: "a version string conforming to the Semantic Versioning requirement"
- I have to fetch the whole datapackage.json (it's not big I know but why download all that extra data I might not even want)
Go around data.okfn.org and look directly at the github repository. Problems:
- I have to find out where the repo is, use git and do a lot of extra stuff (I don't care how the data packages are stored, I just want a simple interface to fetch them)
- What would be the point of data.okfn.org/data? In my mind it collects data packages and provides a consistent interface to get the data packages irrespective of how its stored.

I propose data.okfn.org provides an internal system to allow users to quickly check whether a new version might be released. This does not have to be an API. We could leverage HTTP's caching mechanism using an ETag header that would contain some hash value. This hash value can e.g. be the the sha value of heads ref objects served via the Github API:

https://api.github.com/repos/datasets/cpi/git/refs/heads/master

Software that works with data packages could then implement a caching strategy and just send a request with an If-None-Match header along with a GET request for datapackage.json to either get a new version of the descriptor (and look at the version in that file) or just serve the data from its cache.

Data Package Viewer

Generic data package viewer ...

List datasets

Normalize licenses and license names and display in dataset view

At the moment not clear exactly what is required for licenses and some of the time we just have ids and other times names and urls. We want to ensure given an id we always have a name and url - we could look this up from licenses.opendefinition.org ...

In terms of the interface we want to also handle the unknown case (should that ever happen!!)

This would be part of the tools datapackage normalize code.

Tools page

Overview of tools
Instructions on how to contribute more

R import

Import data into R from a Data Package esp a Simple Data Format data package which has CSV data (see http://data.okfn.org/standards for more on Data Packages and SDF).

Steps:

R command that would import given a datapackage.json url (or file on local disk)
Bonus would be an R command that used data.okfn.org and the data package name e.g. "gold-prices" or "cpi"
- this is just extension of first item since e.g. entry on data.okfn.org has a datapackage.json associated e.g. http://data.okfn.org/data/gold-prices/datapackage.json

Google Spreadsheet Export

An AppScript example (or just doing it by hand?)

DataCatalog.js - Create your own data catalog in 5m from a list of your (or others) data packages

List of data packages
datacatalog.js - see existing code Git and Github for Data - draft post
host on s3 / file system / dropbox etc or similar ...

Enable CORS

Example - House prices regressed on long interest rate and GDP (and population)

R example
(Maybe we do in other languages too later ...)

What does it demonstrate: quickly getting data and using it together

Required data

House price data - http://data.okfn.org/data/house-prices-uk
UK Long interest rate - http://data.okfn.org/data/bond-yields-uk-10y
UK GDP - http://data.okfn.org/data/gdp-uk

Do we need this quarterly?

Why this analysis

People are interested in house prices - why do they go up and down
Classic explanation is other economic variables e.g. demand (GDP, population and interest rate (mortgages)), supply (housing stock). Here will just use demand variables and see what we find.

Standards page

Data packages - general
Simple Data Format - specific

Conceptual ideas:

Parsimony
Progressive degradation (to CSV)
etc

Stats tool

Given a Tabular Data Package compute stats for each file

A stat looks like:

{
  count: record-count
  fields: [
    {
     name: field-id
     sum: 
     avg: 
     ... 
    }
  ]
}

Improved about page and a Contribute page

/about/

/about/contribute/

Retrieve Data Package info from source (on Github)

Currently have demo datapackage info hard coded locally. should switch to loading from source datapackage on github.

Rename / relocate to datasets.okfnlabs.org

More appropriate ...

Example - Publish Data Package from Google Spreadsheet

More of a blog post than something on the site (publish on e.g. okfnlabs.org). Think this is pretty useful.

Walk through of turning a google spreadsheet into a data package

publish to the web
get public CSV url for relevant sheet
create datapackage.json (cf #28) and publish it somewhere
- this is an important one - should we make this part of data.okfn.org itself (cf #52 - community catalog). Simplest option would be gist or even pasting the json into a google doc and getting raw text url for that (seems quite clunky!)

cf #30 (google spreadsheet export tool)

Cost: 2-3h

Search datasets

As a User I want to search datasets so that I can explore quickly and find what I want

Two forms:

Full search - with filtered results on /data page
Quick search with autocomplete taking me to my dataset

UI

search box at top of search page
search box in navbar?

Basic backbone app skeleton with routing

DataPackage.json validator on the web

For data packages spec http://www.dataprotocols.org/en/latest/data-packages.html

JSON Table Schema validator

Validate the JSON Table Schema in a Tabular Data Package (note this is the schema not validating the data itself against the schema).

May already have some of what we need: https://github.com/mk270/json-table-schema-python

Tests (Basic)

It's time for some proper tests!

Datasets Data URLs and API generally

This issue is about the URL / API structure for accessing data (and metadata) from the data packages.

Current Situation

For stuff under /data/: /data/{dataset}/datapackage.json and /data/{dataset}.csv
For other stuff either at /tools/view/ or /community/ via: http://data.okfn.org/tools/dataproxy/?url={path-to-csv} (though this is not much different from datapipes.okfnlabs.org/csv/raw/?url=.... and leaves much to be desired)

Proposal

/data/ + /community/ data packages

For /data/ and /community/ data packages:

/.../{dataset}/datapackage.json     # the datapackage.json file

## data urls
/.../{dataset}/r/{resource-name-or-order}.{format}  

so e.g.

/.../gdp/r/annual.csv   # resource name
/.../gdp/r/0.csv           # resource by index

Formats that we should support would be:

{format} = csv | json | html | raw (by default)
{resource-name} = name as in resources entry. (Also allow order e.g. 1 for first resource, 2 for second resource etc).

Data packages somewhere online

We follow something similar to the other case but instead of data package name in the url we move the data package url to the query string:

/api/datapackage.json?url={datapackage-url}
/api/data/{resource-name-or-index}.{format}?{datapackage-url}

# e.g. this returns first resource as CSV
/api/data/0.csv?url=https://raw.github.com/datasets/browser-stats/master/datapackage.json

Discussion

data.json is the serialization in the most obvious way - i.e. convert to a hash
- alternative provide this in a results style format (and include the schema)
Should we use download attribute to set filename ...?
- Not needed in above
~~(Now supported) How do we handle multiple data resources / files?~~
- ~~worry about that in the future - so only support first resource for the moment (this is good as it privileges single resource data packages ...)~~

Appendix

Alternatives

Alternatively could be:

{dataset}/{filename}.csv
{dataset}/{filename}.json (CORS enabled ...)

{dataset}/data.csv

Think the former is better ...

Include Issues in dataset page

Simplest is link to github issues

Improvements:

Show number of issues
Link for new issue (which states correct which file we are on for datasets with multiple files ...)
javascript popup of the issues ...

View Dataset - View Data in Table

Registry/Catalog of Community Data Packages

We want to allow people to register the data packages they've created in a "community" catalog portion of the site.

Url structure like: /community/{username}/{dp-name}

Options for Implementatoin

Github proxy option (see below)

_Note this option is what is currently (partially) implemented (see comments below for more info)_

data package at github.com/{username}/{repo} shows up at data.okfn.org/community/{username}/{repo}
~~data.okfn.org/community has some info!~~ (no longer relevant as moved to /data)
~~data.okfn.org/community/{username} gives a nice listing~~ now #111

Full option

login approach decided (github/twitter/...)
where do we store info (s3, db, ...)?
What do you register (path do a datapackage.json ...)

Other related things

Creation of data packages (for you)

For example:

Provide a link to a published gdoc page and a data package is created for you (where is it stored? github, s3, database?) and webpage is created

Load a Tabular Data Package from a URL
[Bonus] Search the primary Data Package registry (data.okfn.org/data) and then load the selected Data Package

We should look at Web Query Files: https://support.microsoft.com/en-us/kb/157482

View dataset

For Launch

Site refactor

New front page - #22
Move data catalog - #21
standards page - #36
tools page - #37

Examples & tooling

Integration

Load in sqlite - #25 (sqlite)
Load into postgresql - #26
Google spreadsheet load #24
R load - #23

Wherever possible put in gists or similar so easy to embed and update

Example Data

#35 - all example data

User Story Examples

Deflation
House prices on GDP, long term interest rates, population - #32