GithubHelp home page GithubHelp logo

rufuspollock-okfn / dataconverters Goto Github PK

View Code? Open in Web Editor NEW
100.0 39.0 33.0 947 KB

Python library and command line tool for converting data from one format to another

Home Page: http://okfnlabs.org/dataconverters/

HTML 13.35% Python 86.65%
convert-data python python-library

dataconverters's People

Contributors

domoritz avatar gr33ndata avatar nigelbabu avatar nmashton avatar rufuspollock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataconverters's Issues

Provide guessed types

In order to further process some data, it is necessary that the types of the columns are in the results.

Collect precise user stories

User stories

UK Police Population per capita

Wanted to get population per UK police force. Found

http://www.homeoffice.gov.uk/publications/science-research-statistics/research-statistics/crime-research/population-estimates/pfa-la-pop-house-nos-xls

Leads to http://www.homeoffice.gov.uk/publications/science-research-statistics/research-statistics/crime-research/population-estimates/pfa-la-pop-house-nos-xls?view=Binary

I would like to get this as CSV or JSON (in fact just second worksheet - though you don't know that until you open ...). I would like to either upload this to service or point at the URL and have this converted (BTW: not clear whether this is xls or xlsx from the extension ...)

Next user story ...

Not yet user stories ...

CKAN

  • CSV -> JSON (dataproxy)
  • XLS -> JSON (dataproxy)
  • Shapefile -> GeoJSON
  • KML -> GeoJSON (ask @amercader?)

Fusion Tables

  • Shapefile

CSV -> JSON

But what JSON schema exactly :-)

Output Formats

The question of output format is essentially a question of representation of tabular data in JSON and is therefore much more general than this specific converter. We enumerate various options. Issues to consider:

  • Size Efficiency
  • Schema information e.g. field order, field type
  • Support for multiple header rows
  • Easy of use in JS or elsewhere (e.g. shipping a hash versus an array reduces effort for consumer but increases size)

We are adopting Format 3

Format 1 - Array of Arrays

[
  [ xxx, yyy, zzz, ...]
  [...]
]

Format 2 - Array of Hashes

[
   { fieldname: value, ... }
   {  ... }
]

Format 3: Data + Schema

This corresponds to adding some kind of explicit field / schema information to previous 2 formats:

{
  fields: [
   { field-descriptor },
   { field-descriptor }
  ],
  data: [
   as per either format 1 or format 2
  ]
}

A field descriptor is:

{
  id: field-name or column_no or field name is blank
  type: type of field
  label: (optional) human readable label if different from id (usually won't be here so should be omitted!)
  format: (optional) additional format info about field e.g. data format if known
}

Implementation Notes

Reach out to potential users

eg. to Crisis Response team -- can we be put in touch? Somebody who has a need for data conversion as a service, and potential opinions about the API presentation.

Establish test data and framework setup

Data

Suggest a dedicated s3 or google storage bucket

  • Better than google drive due to size (?)
  • Folder structure
  • See #21 - Shapefile
  • See #20 - CSV

Test framework

Separate from core converters - and ensure runs over http (?)

Defining a Web-Service API

What should the API of the conversion web-service look like?

  • Core arguments (input file, input format, output format)
    • Do you always have to post the file content or can you provide a URL (If already online)
  • Callback versus URL versus simple (streaming) response
  • Extra arguments (specific to a converter) - e.g. projections for geo conversions
  • Response formats: In theory this is the name of the converter. However we may want to limit output formats to a very limited set e.g. CSV and JSON in which case this could be an argument (?)

Spec

Incoming data

May either be:

  • POSTed in the body of the request (?)
  • Provided via ?url attribute that points to the file online

Request arguments

  • Defined per converter

Response formats

  • Defined by the converter (?)

Support for CORS and JSONP

  • Support CORS (for python code see here)
  • Support JSONP via an optional ?callback=name parameter (for python code see here)

Webhooks / Callbacks

Do via headers as per Gut

Errors

...


Existing work

Gut

https://github.com/maxogden/gut

A gut web service (essentially a WebHook) has the following attributes:

  • accepts POST data containing a file or other data (such as CSV)
  • turns the incoming data into some other data format (e.g. an array of JSON objects)
  • sends the converted data back to the URL specified by the incoming requests HTTP "X-callback" header

dataproxy

https://github.com/okfn/dataproxy

  • Pass in data: via Query parameters (you don't POST data - rather it loads off the web -- this could be easily changed)
  • Returns data as the data in Response

Correct JSON output format

This corresponds to adding some kind of explicit field / schema information to previous 2 formats:

{
  fields: [
   { field-descriptor },
   { field-descriptor }
  ],
  data: [
   as per either format 1 or format 2
  ]
}

A field descriptor is:

{
  id: field-name or column_no or field name is blank
  type: type of field
  label: (optional) human readable label if different from id (usually won't be here so should be omitted!)
  format: (optional) additional format info about field e.g. data format if known
}

XML -> CSV

XML is very generic. We'd need a schema mapping (XSLT)?

Service front page

Simple elegant front page.

Question: do we provide a user interface for providing files (RP: I'd say yes)

  • Drag and drop
  • File upload
  • URL provision

This does get somewhat complicated once the upload forms have to support all the relevant options though ...

Convert data if types have been guessed (csv parse)

So instead of

{u'date': u'2011-01-03', u'place': u'Berkeley', u'temperature': u'5'}

We would get

{u'date': '2011-01-03', u'place': u'Berkeley', u'temperature': 5}

Or some unambiguous string representation if converted to json.

RP: note that @domoritz originally suggested having date as datetime object. I think isoformat date(time) is better.

Shapefile -> Records (GeoJSON)

Issue for discussing a shapefile to GeoJSON.

Notes

A shapefile is actually comprised of at least 3 (and usually) 4 files (.shp, .shx, .dbf, and .prj). All of these are needed.

As such if a stream is provided it should be to a zipfile containing all of the relevant files. Otherwise we expect a path to the .shp file and assume all the other files are in that directory.

TODO: what about URLs? Should we support cases where shp files are in an online directory?

Implementation Options

Via GDAL

ogr2ogr -f geoJSON kc.json kc.shp

Could GDAL be used on a server? (Probably not on app-engine, Heroku etc)

Even better is to use the Fiona wrapper for GDAL: http://toblerity.github.com/fiona/

pyshp (Python)

Pure python code for reading and writing shape files.

http://code.google.com/p/pyshp/

Looks pretty good, see, for examples: http://code.google.com/p/pyshp/wiki/PyShpDocs

Note: not clear what it's projection support is (is this a problem for reading stuff ...?)

Some work would need to be done to convert the source data into nice geojson style python dicts etc.

Take a look at https://github.com/okfn/dataproxy - quite a bit to reuse

Define a converter API

This issue is about defining a library API for converters.

Python Proposal

def convert(stream, metadata)

:param stream: file-like stream (only needs a read event)
:param metadata: metadata about the stream that could be relevant
   e.g. the types of the various fields (in case of e.g. CSV -> JSON)
:return (data, metadata, errors): data is a file-like stream in most cases but could also be e.g. JSON
  • What about allowing urls rather than just stream
  • Is synchronous (not asynchronous)
  • How are errors handled (major errors throw exceptions but what about e.g. validation errors?)
    • Major errors that prevent continuing operation generate an Exception
    • Minor errors e.g. fields that don't validate (if that is relevant) go into the errors option, in general at this point errors should be exceptions
class Converter:
    def convert(self, stream, metadata):
         return (data, metadata, errors)

Reference material

Tika

See esp http://tika.apache.org/1.2/parser.html which has this as interface for the libraries

void parse(
    InputStream stream, ContentHandler handler, Metadata metadata,
    ParseContext context) throws IOException, SAXException, TikaException;

Google Conversion API

https://developers.google.com/appengine/docs/python/conversion/overview

Heads Up Post

Brief intro plus the slide-deck?

Plus email to relevant lists

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.