GithubHelp home page GithubHelp logo

okfn / dataexplorer Goto Github PK

View Code? Open in Web Editor NEW
146.0 46.0 47.0 4.19 MB

View, visualize, clean and process data in the browser.

Home Page: http://explorer.okfnlabs.org

CSS 11.35% JavaScript 80.78% HTML 7.86%

dataexplorer's Introduction

View, visualize and transform data in the browser.

Data Explorer is a browser-based (pure HTML + JS) open-source application for exploring and transforming data.

It works well with any source of tabular data. Load and save from multiple sources include google spreadsheets, CSVs and Github. Graph and map data, write javascript to clean and transform data.

Built on Recline JS.

Use it

Visit http://explorer.okfnlabs.org/

Want to use it locally? Just do "save as" and save the html (with all associated files) to your hard disk. Note that for github login to work you will need to have the app opened at a non file:/// url e.g. http://localhost/dataexplorer.

Developers

Install:

git clone --recursive https://github.com/okfn/dataexplorer

Then just open index.html in your browser!

Note: if you just open index.html most of the app will function but login will not work. For login to work on your local machine you must deploy the app at this specific URL:

http://localhost/src/dataexplorer/

The reason for this is that a (uniqute) "callback" URL to the location of the DataExplore instance that the OAuth login will send users back to has to be set in Github when you set up the OAuth "app" in Github (and that URL is the one listed there).

If you are running or nginx or apache on your local machine setting up an alias like this to your local src directory should be easy. Also if you have python installed , you can run SimpleHTTPServer from src 's parent directory.

python -m  SimpleHTTPServer 80

Github Login

Login is via Github using their OAuth method.

We have a pure HTML / JS app (no standard backend) and with pure HTML/JS you can't do OAuth Github login directly and need an OAuth proxy in the form of gatekeeper.

Thus, if you want to deploy your own instance of Data Explorer you'll need to set up a new instance of gatekeeper and then change the gatekeeper_url value in src/boot.js.

Understanding the Architecture

To learn more about the the code see doc/developers.md

Deploying

For github login you will need to set up your own gatekeeper instance as per above.

License and Credits

The first version of this app was built by Michael Aufreiter and Rufus Pollock. It reused several portions of Prose including github login and portions of the styling.

Licensed under the MIT license.

All Credits as per Recline. Also all the great vendor libraries including:

dataexplorer's People

Contributors

aliounedia avatar andylolz avatar coderaiser avatar dieterbe avatar djw avatar fyears avatar michael avatar mk270 avatar roll avatar rufuspollock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataexplorer's Issues

[super] Project objects

Project objects encapsulating a given activity around a dataset

  • Dataset (or source thereof)
  • Scripts
  • Views (graphs, maps etc with config)
  • Export destinations

Script Execution

Execute scripts (part of #35). Strong connection with script editor #45

  • run in sandbox
    • show output
  • Run fully (what do you get to change?)

Context for script editor:

  • _ / lodash
  • $ (impossible for webworkers)
  • dataset

Improve id generation and saving

  • Prepend dataexplorer when saving to localStorage (and don't include it in id itself)
  • Use meaningful names and just append '-{integer)' to avoid duplication ...

Braindump

Create a Project

As a User I want to create a Project and associate files to that project (online or local) so that I can reload that project later automatically

A project has:

  • ID, title, description, keywords
  • Resources
    ** Data Files
    ** Data APIs
  • Scripts
  • Apps/Views

Notes:

  • Do we really need multiple files?

Write scripts

As a User I want to write scripts for a project so that I can re-run those scripts later and thereby recreate the results (e.g. a specific visualization)

Specify type metadata

As a User I want to create type information about an object

Export data

As a User I want to export my data to an online service such as the DataHub

Notes:

  • I want my API key to be easily retrieved in a secure way
  • I want to have my login details remembered for next time so I don't have to re-add them ...
  • I want export to happen reasonably quickly and progress to be shown (bulk export)
  • I want upserts to happen when needed when object with that ID already exists
  • I want the connection of this data file with a given online store to be remembered so I can easily repeat this upload later

Share with Others

As a User I want to Share my project with others so that they can see what I have done

[super] Scripts & Scripting including Editor, Storage and Execution

Currently have "transformations". Let's turn this into full-on scripting in the form of full JS.

Implementation

  • Model stuff: scripts on projects etc - #44
  • script editor - #45
  • Script execution - #46
    • sandboxed (in an iframe or webworker)
    • live ...
      • security considerations ...
  • Integrate into UI - (cf #43)

"Transformations" tab does not show until reloaded

Workflow: (google chrome)

Create a project, go on transform, be unhappy click: "My Projects" create a new one, click "Transform" -> Transform does not show, the list view stays.

Solution reload and then select the project from my projects

[super] Data Cleaning Examples

General Thoughts

  • Many useful examples require ability to load multiple datasets => we must be able to load remote data as part of the scripting.
    • More strongly: does a focus on a single dataset in a project make sense? Refine does that but ...
  • Geocoding also requires external access

Scripting Library

For scripting to be really useful we need some standard functions

  • plot(dataset, config, name) - #69
  • loadData(urlOrConfig, ...).done(function(dataObject) {}) - could we just use bits of recline atm? - #74
  • geocode - #68
  • saveDataset
  • direct xhr ... - #66

External access

=> We need an ajax library - see #66

Use Cases

Cleaning

Merging / Transforming

Miscellaenous

  • Doing sums ... (how useful ...)
  • Binning (pivot tables ...)

Configurable save

  • Save by default to source from which we loaded
    • Requires that backend is writable - not so for CSV from disk and online CSV (??)
  • Could just allow this to be configurable - so you can choose from github or ...

Auto-Save script to local storage

Auto save script (after every key stroke, every 30s, after every run?) to local storage so that if browser crashes or you close window you can restore later.

Save and load scripts

We should be able to save and load clean up scripts from github

  • Save of scripts should be to a gist by default (later we can add choice of save location)
    • If we loaded the script we should remember that (localStorage or a cookie) and then save back to there
  • Load scripts - specify location similar to specification of data location

Better UX

  • Run on all records notifies of success
  • Save shows spinner while waiting to complete

Persisting per-user Data Explorer config (incl e.g. list of projects)

For time being will just be the list of projects.

{
projects: [{
  id: ...
  gist_id: ... # maybe the same
  state: active | deleted
}
  ....
]
}

Persistence to special gist

Name: DataExplorerConfig.json

Boot sequence:

  • if not logged in: END
  • (if logged in) get all gists: http://developer.github.com/v3/gists/#list-gists
  • search for DataExplorerConfig.json
    • if it does not exist, we create local model DataExplorerConfig and have empty list of projects
    • if does load data and initialize DataExplorerConfig with it

Persistence is automatic on each change ...

Support gdocs as backend

This would be awesome with gdocs as a backend!

  • Read support is almost trivial (get this straight from recline)
  • Write support - now this is interesting and I (@rgrp) have thought about this a lot - see below for summary

Write Support to GDocs in JS

Google now use OAuth. This is normally a PITA to support (witness the hassle to get login to github via oauth) but Google specifically support client side stuff:

The Google OAuth 2.0 Authorization Server supports JavaScript applications (JavaScript running in a browser). Like the other scenarios, this one begins by redirecting a browser (popup, or full-page if needed) to a Google URL with a set of query string parameters that indicate the type of Google API access the application requires. Google handles the user authentication, session selection, and user consent. The result is an access token. The client should then validate the token. After validation, the client includes the access token in a Google API request. 1

To find out we need the Google Docs on OAuth for Client Side Apps

Links

Import/Export and Data workflow

This is a overview how user usually works with data (see attached diagram). There exists lots of formats and data services, therefore a modular architecture is needed to achieve most flexibility, that would result in most useful user experiences.

Data can be generally either serialized into file and stored somewhere, or accessed using APIs.

data-workflow-diagram.png

System components

  • Importers/Exporters
    • Backends - transfer format is dictated by API, needs credential management
    • Remote File - probably some proxy needed for cross origin, needs optional credential management
    • Local file - File API, Drag and Drop
    • Clipboard - using textareas or clipboard libraries
  • Service detector
    • For remote services, most comfortable is just to specify URL and system should try to guess service by URL, e.g. if it is a GDocs, CKAN dataset, etc. Then prompt for more details only if necessary.
  • Format detector for deserialization
    • Similar to service detector but for formats
  • Deserializers
    • for each format, e.g. csv, json, xml
    • provide auto-detection with reasonable defaults, ask user only if necessary
  • Serializers
    • text based, e.g. csv, json, xml, (xls?)
    • image based - canvas, svg, export to bitmap, pdf

Formats

  • text based - csv, json, xml, HTML tables
  • binary - xls, ods
  • maps, graphs - images - png, jpg, svg, pdf

Backends (as in Recline.js)

  • ckan
  • couchdb
  • csv
  • dataproxy
  • elasticsearch
  • gdocs
  • memory
  • solr

CSV uses memory backend, it is not a logically backend, just a format. Therefore having a file/document backend with a given format would be more flexible.

Auto-detection

User should be bothered by need to provide additional input as little as possible. Reasonable defaults or auto-detection should be utilized. For exporting some live preview of part of data should be available.

Backends need data from user to specify credentials or format options. That data can be viewed simply as JSON object. Some general form building library like Alpaca (based on general JSON-Schema) or Backbone-Forms can be utilized. This solution have advantage that adding a new format or backend does not require to write UI related code.

Operation stack

Instead of just exporting and saving static data, it would be comfortable to provide option to share application state (stack of applied operations, queries, transformations, visualization optins, etc.) via URL. This encourages easy sharing and also if data are corrected in original source, all derived data would appear also corrected.

Related

recline issues:

dataexplorer issues:

Next steps

  • discussion
  • sketches of screens for DataExplorer using above architecture
  • propose class structure for additional Recline.js functionality

Functional tests

Getting to the point where development will become unsustainable without tests ...

Scripts in Model

Part of #35 (scripts & scripting)

Implementation

Should look like a gist pretty much :-)

{
  # aka name (but unique)
  id: ...
  # the content of the script 
  content: 
  language: javascript
}

Possible for the future

  # e.g. transform, standard ...
  type: ...
  # for remote scripts (i.e. ones you import and reuse)
  url: 

Undo support (?)

Doubt this is needed but worth recording anyway.

Not needed because you could just reload the source data and re-run the script ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.