okfn / dataexplorer Goto Github PK

View, visualize, clean and process data in the browser.

CSS 11.35% JavaScript 80.78% HTML 7.86%

dataexplorer's Introduction

View, visualize and transform data in the browser.

Data Explorer is a browser-based (pure HTML + JS) open-source application for exploring and transforming data.

It works well with any source of tabular data. Load and save from multiple sources include google spreadsheets, CSVs and Github. Graph and map data, write javascript to clean and transform data.

Built on Recline JS.

Use it

Visit http://explorer.okfnlabs.org/

Want to use it locally? Just do "save as" and save the html (with all associated files) to your hard disk. Note that for github login to work you will need to have the app opened at a non file:/// url e.g. http://localhost/dataexplorer.

Developers

Install:

git clone --recursive https://github.com/okfn/dataexplorer

Then just open index.html in your browser!

Note: if you just open index.html most of the app will function but login will not work. For login to work on your local machine you must deploy the app at this specific URL:

http://localhost/src/dataexplorer/

The reason for this is that a (uniqute) "callback" URL to the location of the DataExplore instance that the OAuth login will send users back to has to be set in Github when you set up the OAuth "app" in Github (and that URL is the one listed there).

If you are running or nginx or apache on your local machine setting up an alias like this to your local src directory should be easy. Also if you have python installed , you can run SimpleHTTPServer from src 's parent directory.

python -m  SimpleHTTPServer 80

We have a pure HTML / JS app (no standard backend) and with pure HTML/JS you can't do OAuth Github login directly and need an OAuth proxy in the form of gatekeeper.

Thus, if you want to deploy your own instance of Data Explorer you'll need to set up a new instance of gatekeeper and then change the gatekeeper_url value in src/boot.js.

Understanding the Architecture

To learn more about the the code see doc/developers.md

Deploying

For github login you will need to set up your own gatekeeper instance as per above.

License and Credits

The first version of this app was built by Michael Aufreiter and Rufus Pollock. It reused several portions of Prose including github login and portions of the styling.

Licensed under the MIT license.

All Credits as per Recline. Also all the great vendor libraries including:

Backbone
Bootstrap
Leaflet
Flot
CodeMirror

dataexplorer's People

Contributors

Stargazers

Watchers

dataexplorer's Issues

[super] Project objects

Project objects encapsulating a given activity around a dataset

Dataset (or source thereof)
Scripts
Views (graphs, maps etc with config)
Export destinations

Reinstate notifications (view)

Script Execution

Execute scripts (part of #35). Strong connection with script editor #45

run in sandbox
- show output
Run fully (what do you get to change?)

Context for script editor:

_ / lodash
$ (impossible for webworkers)
dataset

Improve id generation and saving

Prepend dataexplorer when saving to localStorage (and don't include it in id itself)
Use meaningful names and just append '-{integer)' to avoid duplication ...

Show load view or past projects on startup

Don't have login as default screen

Load data from online CSV

Assume accessible via CORs or on same domain

Braindump

Create a Project

As a User I want to create a Project and associate files to that project (online or local) so that I can reload that project later automatically

A project has:

ID, title, description, keywords
Resources
** Data Files
** Data APIs
Scripts
Apps/Views

Notes:

Do we really need multiple files?

Write scripts

As a User I want to write scripts for a project so that I can re-run those scripts later and thereby recreate the results (e.g. a specific visualization)

Specify type metadata

As a User I want to create type information about an object

Export data

As a User I want to export my data to an online service such as the DataHub

Notes:

I want my API key to be easily retrieved in a secure way
I want to have my login details remembered for next time so I don't have to re-add them ...
I want export to happen reasonably quickly and progress to be shown (bulk export)
I want upserts to happen when needed when object with that ID already exists
I want the connection of this data file with a given online store to be remembered so I can easily repeat this upload later

Share with Others

As a User I want to Share my project with others so that they can see what I have done

Load data from GDocs

Convert to bootstrap

Switch header to bootstrap etc

About / Intro / Splash page

Nice page giving a quick intro and overview of what is on offer

Load / save projects (to local storage)

Suggest using:

https://github.com/jeromegn/Backbone.localStorage

[super] Scripts & Scripting including Editor, Storage and Execution

Currently have "transformations". Let's turn this into full-on scripting in the form of full JS.

Would still be nice if you could do simple map / reduce (rather than having to wrap this laboriously yourself ...)
Maybe need some concepts of types of transform ... cf https://github.com/okfn/dataexplorer/blob/master/doc/design.markdown

Implementation

Model stuff: scripts on projects etc - #44
script editor - #45
Script execution - #46
- sandboxed (in an iframe or webworker)
- live ...
  - security considerations ...
Integrate into UI - (cf #43)

Script Editor

Part of #35 (Scripts & Scripting)

Use CodeMirror here. Some connection with script execution (#46) as part of same UI?

Some nice extras:

Help / tutorial (??)
Autocomplete - http://codemirror.net/2/demo/complete.html

"Transformations" tab does not show until reloaded

Workflow: (google chrome)

Create a project, go on transform, be unhappy click: "My Projects" create a new one, click "Transform" -> Transform does not show, the list view stays.

Solution reload and then select the project from my projects

Save to new location (rather than original source)

Dedicated login view

[super] Relayout main project view (primary data explorer with grid etc)

This is the primary work area. Want to get the layout right. Key principles:

Minimize clutter
Key components - this relates to the design document
- Grid (just a special view??)
- Views (graphs, maps etc etc)
- Summary (Readme) - #78
- Scripts
- Info (info about the project and raw representation of content?)

Where do we display things e.g. grid and script editor together?

Very incomplete Sketch - Google Drawing

Running transform doesn't give indication it does something

Just playing with data converter from a google spreadsheet. (https://docs.google.com/spreadsheet/ccc?key=0AlgwwPNEvkP7dGxsWFhoeWljWV9BNHVMbFRVRHQyZXc#gid=0) I would like to convert the tags to lower case. Therefore I run the following transform:

function(doc) {
doc['tags'] = doc['tags'].toLowerCase();
return doc;
}

If I click "Run on all records" it doesn't give me an indication the running is finished, can we have this?

Github Login in separate window

Want this so we can do login at any time without disturbing process in main window

[super] Data Cleaning Examples

General Thoughts

Many useful examples require ability to load multiple datasets => we must be able to load remote data as part of the scripting.
- More strongly: does a focus on a single dataset in a project make sense? Refine does that but ...
Geocoding also requires external access

Scripting Library

For scripting to be really useful we need some standard functions

plot(dataset, config, name) - #69
loadData(urlOrConfig, ...).done(function(dataObject) {}) - could we just use bits of recline atm? - #74
geocode - #68
saveDataset
direct xhr ... - #66

External access

=> We need an ajax library - see #66

Use Cases

Cleaning

Population from World Bank: https://github.com/datasets/population/blob/master/scripts/process.py
Geocoding - #28 (but which dataset ...)

Merging / Transforming

Deflating (taking out inflation)
- UK home prices - https://github.com/datasets/house-prices-uk
- stock market, ...
Per-capitizing

Miscellaenous

Doing sums ... (how useful ...)
Binning (pivot tables ...)

Rename projects view to dashboard

Running "transform" results in all entries the same

Steps to reproduce (chrome)

load the spreadsheet from: https://docs.google.com/spreadsheet/ccc?key=0AlgwwPNEvkP7dGxsWFhoeWljWV9BNHVMbFRVRHQyZXc#gid=0
go to transform
run
javascript function(doc) { doc['tags'] = doc['tags'].toLowerCase(); return doc; }
switch back to grid view: All the entries are the same

Load data from CSV on disk

Transformations uses CodeMirror

Have nice code editing using codemirror. Even better would be to have a Run button to try out the code.

Auto-adjust heights / widths of grids, graphs etc on project view page based on screen size

change github oauth client settings

this link: https://github.com/login/oauth/authorize?client_id=2bab62e2f6b27c3ebe1f&scope=repo,%20user

redirects to /src/transformer/?code=58296c98c7388a0a5cfb

but it should redirect to ?code=58296c98c7388a0a5cfb

i think you have to change the github app settings for 2bab62e2f6b27c3ebe1f

Configurable save

Save by default to source from which we loaded
- Requires that backend is writable - not so for CSV from disk and online CSV (??)
Could just allow this to be configurable - so you can choose from github or ...

Auto-Save script to local storage

Auto save script (after every key stroke, every 30s, after every run?) to local storage so that if browser crashes or you close window you can restore later.

GeoCoding Example

Save and load scripts

We should be able to save and load clean up scripts from github

Save of scripts should be to a gist by default (later we can add choice of save location)
- If we loaded the script we should remember that (localStorage or a cookie) and then save back to there
Load scripts - specify location similar to specification of data location

Load CSVs from places other than github

Get save working again to github

Better UX

Run on all records notifies of success
Save shows spinner while waiting to complete

Add CKAN as a backend

github integration reports success on failure

URLs for individual projects (at e.g. project/{id})

Also: rename dataset view to project view.

Support configuring files to load from url

e.g. ?backend=csv&url=....

Persisting per-user Data Explorer config (incl e.g. list of projects)

For time being will just be the list of projects.

{
projects: [{
  id: ...
  gist_id: ... # maybe the same
  state: active | deleted
}
  ....
]
}

Persistence to special gist

Name: DataExplorerConfig.json

Boot sequence:

if not logged in: END
(if logged in) get all gists: http://developer.github.com/v3/gists/#list-gists
search for DataExplorerConfig.json
- if it does not exist, we create local model DataExplorerConfig and have empty list of projects
- if does load data and initialize DataExplorerConfig with it

Persistence is automatic on each change ...

SlickGrid grid intermittent error "Uncaught Error: Cannot find stylesheet."

Have been intermittently getting an error where SlickGrid does not display and have in console: "Uncaught Error: Cannot find stylesheet." (slick.grid.min.js:39)

Support gdocs as backend

This would be awesome with gdocs as a backend!

Read support is almost trivial (get this straight from recline)
Write support - now this is interesting and I (@rgrp) have thought about this a lot - see below for summary

Write Support to GDocs in JS

Google now use OAuth. This is normally a PITA to support (witness the hassle to get login to github via oauth) but Google specifically support client side stuff:

The Google OAuth 2.0 Authorization Server supports JavaScript applications (JavaScript running in a browser). Like the other scenarios, this one begins by redirecting a browser (popup, or full-page if needed) to a Google URL with a set of query string parameters that indicate the type of Google API access the application requires. Google handles the user authentication, session selection, and user consent. The result is an access token. The client should then validate the token. After validation, the client includes the access token in a Google API request. 1

To find out we need the Google Docs on OAuth for Client Side Apps

Import/Export and Data workflow

This is a overview how user usually works with data (see attached diagram). There exists lots of formats and data services, therefore a modular architecture is needed to achieve most flexibility, that would result in most useful user experiences.

Data can be generally either serialized into file and stored somewhere, or accessed using APIs.

System components

Importers/Exporters
- Backends - transfer format is dictated by API, needs credential management
- Remote File - probably some proxy needed for cross origin, needs optional credential management
- Local file - File API, Drag and Drop
- Clipboard - using textareas or clipboard libraries
Service detector
- For remote services, most comfortable is just to specify URL and system should try to guess service by URL, e.g. if it is a GDocs, CKAN dataset, etc. Then prompt for more details only if necessary.
Format detector for deserialization
- Similar to service detector but for formats
Deserializers
- for each format, e.g. csv, json, xml
- provide auto-detection with reasonable defaults, ask user only if necessary
Serializers
- text based, e.g. csv, json, xml, (xls?)
- image based - canvas, svg, export to bitmap, pdf

Formats

text based - csv, json, xml, HTML tables
binary - xls, ods
maps, graphs - images - png, jpg, svg, pdf

Backends (as in Recline.js)

ckan
couchdb
csv
dataproxy
elasticsearch
gdocs
memory
solr

CSV uses memory backend, it is not a logically backend, just a format. Therefore having a file/document backend with a given format would be more flexible.

Auto-detection

User should be bothered by need to provide additional input as little as possible. Reasonable defaults or auto-detection should be utilized. For exporting some live preview of part of data should be available.

Backends need data from user to specify credentials or format options. That data can be viewed simply as JSON object. Some general form building library like Alpaca (based on general JSON-Schema) or Backbone-Forms can be utilized. This solution have advantage that adding a new format or backend does not require to write UI related code.

Operation stack

Instead of just exporting and saving static data, it would be comfortable to provide option to share application state (stack of applied operations, queries, transformations, visualization optins, etc.) via URL. This encourages easy sharing and also if data are corrected in original source, all derived data would appear also corrected.

recline issues:

dataexplorer issues:

#18 Configurable save

Next steps

discussion
sketches of screens for DataExplorer using above architecture
propose class structure for additional Recline.js functionality

Query state
View state

Already done this a bunch of times for recline so should not be too hard - if you are thinking of working on this check out the State related stuff in Recline!

Could not view projects after login thru github

When I click "My Projects" after login thru GitHub nothing going on. No errors appear in console.

Chrome  v23.0.1271.97 m

Scripts in Model

Part of #35 (scripts & scripting)

Implementation

Should look like a gist pretty much :-)

{
  # aka name (but unique)
  id: ...
  # the content of the script 
  content: 
  language: javascript
}

Possible for the future

  # e.g. transform, standard ...
  type: ...
  # for remote scripts (i.e. ones you import and reuse)
  url:

Data viewer based on multiview and includes map and slickgrid grid

Switch to using full Recline multiview so we have full viewer including:

map
slickgrid grid
search / filter

Undo support (?)

Doubt this is needed but worth recording anyway.

Not needed because you could just reload the source data and re-run the script ...

Introduce notifications in the UI

Loading
Success / fail (e.g. of transformations)

okfn / dataexplorer Goto Github PK

dataexplorer's Introduction

dataexplorer's People

Contributors

Stargazers

Watchers

Forkers

dataexplorer's Issues

Create a Project

Write scripts

Specify type metadata

Export data

Share with Others

Implementation

General Thoughts

Scripting Library

External access

Use Cases

Cleaning

Merging / Transforming

Miscellaenous

Persistence to special gist

Write Support to GDocs in JS

Links

System components

Formats

Backends (as in Recline.js)

Auto-detection

Operation stack

Related

Next steps

Implementation

Recommend Projects

Recommend Topics

Recommend Org

Jobs