GithubHelp home page GithubHelp logo

frictionlessdata / frictionlessdata.io Goto Github PK

View Code? Open in Web Editor NEW
140.0 32.0 53.0 219.89 MB

The main repository of the Frictionless Data project. Website, issues, and discussions

Home Page: http://frictionlessdata.io

License: MIT License

Vue 65.07% JavaScript 32.13% Stylus 1.92% CSS 0.89%

frictionlessdata.io's Introduction

Frictionless Data

Build Codebase Support

This is a repo for managing the Frictionless project โ€“ https://frictionlessdata.io/. As such it is more core team focused. ๐Ÿ˜„

Want to cite this repo? Please use this DOI: DOI

Where to open issues?

Summary:

How to contribute to the website

This is the new FrictionlessData.io website to be released in 2020. It reflects the recent updates made to Frictionless Data project setup and brand.

Development

$ npm install
$ npm start

Deployment

New commits into the master branch will be automatically deployed to GitHub Pages by a workflow.

frictionlessdata.io's People

Contributors

danfowler avatar dependabot[bot] avatar lwinfree avatar monikappv avatar roll avatar rufuspollock avatar sapetti9 avatar serahkiburu avatar shashigharti avatar stephen-gates avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

frictionlessdata.io's Issues

Example Data Packages and SDF datasets

  • site integration
  • examples

Site Integration

Examples

Non-SDF

SDF

DataPackage.json Creator for gists

I'm exploring using gists as a way for novices to group small numbers of simple data files together. Creating the datapackage file is a pain, so I wondered whether the datapackage.json creator could automatically generate metadata from gists directly? For example, https://gist.github.com/psychemedia/5633865.json describes a gist with several files:
{ //...
files: [
"BANGLADESH-RMG-EXPORT-TO-WORLD-BY-COUNTRY.csv",
"BANGLADESH-RMG-EXPORTS-TO-WORLD-10-12-MONTHLY.csv",
"BANGLADESH-RMG-EXPORTS-TO-WORLD-11-13-MONTHLY.csv",
"BGEA-MEMBERSHIP-AND-EMPLOYMENT.csv",
"BGMEA-COMPARATIVE-STATEMENT-ON-EXPORT-OF-RMG-AND-TOTAL-EXPORT-OF-BANGLADESH.csv",
"MAIN-APPAREL-ITEMS-EXPORTED-FROM-BANGLADESH.csv",
"VALUE-AND-QUANTITY-OF-TOTAL-APPAREL-EXPORT-FISCAL.csv",
"VALUE-AND-QUANTITY-OF-TOTAL-APPAREL-EXPORT.csv",
"datapackage.json"
]
//...
}

(the datapackage there was an early attempt at handcrafting part of the datapackage description;-(

Would it make sense for a series of plugins based around URL detection, so for example:
i) paste in url='https://gist.github.com/psychemedia/5633865'
ii) identify 'https://gist.github.com/' from url
iii) retrieve url+'.json'
iv) loop through files[], identify .csv and generate combined datapackage json
v) display datapackage.json for user to add to gist, or
vi) provide the datapackage.json generation as service (eg call http://data.okfn.org/tools/dp/api/create?url=https://gist.github.com/psychemedia/5633865 and get the datapackage.json back as json?

Looking at the gist JSON, there is also a 'description' element that could be used as a the datapackage description. The original URL could default set the name set on URL properties (eg gist-psychemedia-5633865)

Switch from python back to pure JS

Why?

  • Keep it ultra-simple
  • Easier to deploy and modify
  • The API can be separate app any way ...

Plus we can always use nodejs for the SEO part of things and having proper urls ...

Pro / Cons

For Browser JS

  • Ease of deployment via e.g. gh-pages (but Heroku ain't that bad)
  • Can be used with any index.json (so could be used locally) - BUT YAGNI ...
  • Easier for people to hack on (relatively small benefit)
  • Native JS + JSON ...

Plus NodeJS

  • Get all the benefits of python

For Python

  • Integrate API - BUT API will be separate?
  • SEO
  • Proper redirects, more mature ...
  • Now doing it in python ...

DataPackage.json creator

Given a CSV file help me create a DataPackage.json (interactively)

  • key fields: name, title, license (defaults of open data commons ...)
  • schema for resources ..
  • JSON API
  • HTML page + interactive form

Locate at: /tools/dp/creator (or just /tools/creator)?

Bonus points for:

  • type guessing (for fields)
  • POST support

Data version API

I would like to be able to use data.okfn.org as an intermediary between my software and the data packages it uses and be able to quickly check whether there's a new version available of the data (e.g. if I've cached the package on a local machine).

There are ways to do it with the current setup:

  1. Download the datapackage.json descriptor file, parse it and get the version there and check it against my local version. Problems:
    • This solution relies on humans and that they update their version but there might not be any consistency in it since the data package standard describes the version attribute as: "a version string conforming to the Semantic Versioning requirement"
    • I have to fetch the whole datapackage.json (it's not big I know but why download all that extra data I might not even want)
  2. Go around data.okfn.org and look directly at the github repository. Problems:
    • I have to find out where the repo is, use git and do a lot of extra stuff (I don't care how the data packages are stored, I just want a simple interface to fetch them)
    • What would be the point of data.okfn.org/data? In my mind it collects data packages and provides a consistent interface to get the data packages irrespective of how its stored.

I propose data.okfn.org provides an internal system to allow users to quickly check whether a new version might be released. This does not have to be an API. We could leverage HTTP's caching mechanism using an ETag header that would contain some hash value. This hash value can e.g. be the the sha value of heads ref objects served via the Github API:

https://api.github.com/repos/datasets/cpi/git/refs/heads/master

Software that works with data packages could then implement a caching strategy and just send a request with an If-None-Match header along with a GET request for datapackage.json to either get a new version of the descriptor (and look at the version in that file) or just serve the data from its cache.

Normalize licenses and license names and display in dataset view

At the moment not clear exactly what is required for licenses and some of the time we just have ids and other times names and urls. We want to ensure given an id we always have a name and url - we could look this up from licenses.opendefinition.org ...

In terms of the interface we want to also handle the unknown case (should that ever happen!!)

This would be part of the tools datapackage normalize code.

Tools page

  • Overview of tools
  • Instructions on how to contribute more

R import

Import data into R from a Data Package esp a Simple Data Format data package which has CSV data (see http://data.okfn.org/standards for more on Data Packages and SDF).

Steps:

  • R command that would import given a datapackage.json url (or file on local disk)
  • Bonus would be an R command that used data.okfn.org and the data package name e.g. "gold-prices" or "cpi"

Example - House prices regressed on long interest rate and GDP (and population)

  • R example
  • (Maybe we do in other languages too later ...)

What does it demonstrate: quickly getting data and using it together

Required data

Do we need this quarterly?

Why this analysis

  • People are interested in house prices - why do they go up and down
  • Classic explanation is other economic variables e.g. demand (GDP, population and interest rate (mortgages)), supply (housing stock). Here will just use demand variables and see what we find.

Standards page

  • Data packages - general
  • Simple Data Format - specific

Conceptual ideas:

  • Parsimony
  • Progressive degradation (to CSV)
  • etc

Stats tool

Given a Tabular Data Package compute stats for each file

A stat looks like:

{
  count: record-count
  fields: [
    {
     name: field-id
     sum: 
     avg: 
     ... 
    }
  ]
}

Example - Publish Data Package from Google Spreadsheet

More of a blog post than something on the site (publish on e.g. okfnlabs.org). Think this is pretty useful.

Walk through of turning a google spreadsheet into a data package

  • publish to the web
  • get public CSV url for relevant sheet
  • create datapackage.json (cf #28) and publish it somewhere
    • this is an important one - should we make this part of data.okfn.org itself (cf #52 - community catalog). Simplest option would be gist or even pasting the json into a google doc and getting raw text url for that (seems quite clunky!)

cf #30 (google spreadsheet export tool)

Cost: 2-3h

Search datasets

As a User I want to search datasets so that I can explore quickly and find what I want

Two forms:

  • Full search - with filtered results on /data page
  • Quick search with autocomplete taking me to my dataset

UI

  • search box at top of search page
  • search box in navbar?

Datasets Data URLs and API generally

This issue is about the URL / API structure for accessing data (and metadata) from the data packages.

Current Situation

  • For stuff under /data/: /data/{dataset}/datapackage.json and /data/{dataset}.csv
  • For other stuff either at /tools/view/ or /community/ via: http://data.okfn.org/tools/dataproxy/?url={path-to-csv} (though this is not much different from datapipes.okfnlabs.org/csv/raw/?url=.... and leaves much to be desired)

Proposal

/data/ + /community/ data packages

For /data/ and /community/ data packages:

/.../{dataset}/datapackage.json     # the datapackage.json file

## data urls
/.../{dataset}/r/{resource-name-or-order}.{format}  

so e.g.

/.../gdp/r/annual.csv   # resource name
/.../gdp/r/0.csv           # resource by index

Formats that we should support would be:

  • {format} = csv | json | html | raw (by default)
  • {resource-name} = name as in resources entry. (Also allow order e.g. 1 for first resource, 2 for second resource etc).

Data packages somewhere online

We follow something similar to the other case but instead of data package name in the url we move the data package url to the query string:

/api/datapackage.json?url={datapackage-url}
/api/data/{resource-name-or-index}.{format}?{datapackage-url}

# e.g. this returns first resource as CSV
/api/data/0.csv?url=https://raw.github.com/datasets/browser-stats/master/datapackage.json

Discussion

  • data.json is the serialization in the most obvious way - i.e. convert to a hash
    • alternative provide this in a results style format (and include the schema)
  • Should we use download attribute to set filename ...?
    • Not needed in above
  • (Now supported) How do we handle multiple data resources / files?
    • worry about that in the future - so only support first resource for the moment (this is good as it privileges single resource data packages ...)

Appendix

Alternatives

Alternatively could be:

{dataset}/{filename}.csv
{dataset}/{filename}.json (CORS enabled ...)

Or

{dataset}/data.csv

Think the former is better ...

Include Issues in dataset page

Simplest is link to github issues

Improvements:

  • Show number of issues
  • Link for new issue (which states correct which file we are on for datasets with multiple files ...)
  • javascript popup of the issues ...

Registry/Catalog of Community Data Packages

We want to allow people to register the data packages they've created in a "community" catalog portion of the site.

Url structure like: /community/{username}/{dp-name}

Options for Implementatoin

Github proxy option (see below)

_Note this option is what is currently (partially) implemented (see comments below for more info)_

  • data package at github.com/{username}/{repo} shows up at data.okfn.org/community/{username}/{repo}
  • data.okfn.org/community has some info! (no longer relevant as moved to /data)
  • data.okfn.org/community/{username} gives a nice listing now #111

Full option

  • login approach decided (github/twitter/...)
  • where do we store info (s3, db, ...)?
  • What do you register (path do a datapackage.json ...)

Other related things

Creation of data packages (for you)

For example:

  • Provide a link to a published gdoc page and a data package is created for you (where is it stored? github, s3, database?) and webpage is created

For Launch

Site refactor

  • New front page - #22
  • Move data catalog - #21
  • standards page - #36
  • tools page - #37

Examples & tooling

Integration

  • Load in sqlite - #25 (sqlite)
  • Load into postgresql - #26
  • Google spreadsheet load #24
  • R load - #23

Wherever possible put in gists or similar so easy to embed and update

Example Data

  • #35 - all example data

User Story Examples

  • Deflation
  • House prices on GDP, long term interest rates, population - #32

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.