rufuspollock-okfn / dataconverters Goto Github PK

View Code? Open in Web Editor NEW

100.0 39.0 33.0 947 KB

Python library and command line tool for converting data from one format to another

Home Page: http://okfnlabs.org/dataconverters/

HTML 13.35% Python 86.65%

convert-data python python-library

dataconverters's People

Contributors

Stargazers

Watchers

dataconverters's Issues

Provide guessed types

In order to further process some data, it is necessary that the types of the columns are in the results.

User stories

UK Police Population per capita

Wanted to get population per UK police force. Found

http://www.homeoffice.gov.uk/publications/science-research-statistics/research-statistics/crime-research/population-estimates/pfa-la-pop-house-nos-xls

Leads to http://www.homeoffice.gov.uk/publications/science-research-statistics/research-statistics/crime-research/population-estimates/pfa-la-pop-house-nos-xls?view=Binary

I would like to get this as CSV or JSON (in fact just second worksheet - though you don't know that until you open ...). I would like to either upload this to service or point at the URL and have this converted (BTW: not clear whether this is xls or xlsx from the extension ...)

Next user story ...

Not yet user stories ...

CKAN

CSV -> JSON (dataproxy)
XLS -> JSON (dataproxy)
Shapefile -> GeoJSON
KML -> GeoJSON (ask @amercader?)

Fusion Tables

Shapefile

Guess file type for files POSTed to converters

The file types for files POSTed to the converters API can be guessed from the mime type.

POST file to the API

The API should also accept file content POSTed to it.

CSV -> RDF

Info wanted!

Obtain file-format frequency table from Sree's research

Useful for prioritisation and triage of other issues.

Headers list does not correspond to keys used when there are blank header columns.

Allow declaration of target format in dataconverters API

The dataconverters library doesn't allow defining target format yet.

CSV -> JSON

But what JSON schema exactly :-)

Output Formats

The question of output format is essentially a question of representation of tabular data in JSON and is therefore much more general than this specific converter. We enumerate various options. Issues to consider:

Size Efficiency
Schema information e.g. field order, field type
Support for multiple header rows
Easy of use in JS or elsewhere (e.g. shipping a hash versus an array reduces effort for consumer but increases size)

We are adopting Format 3

Format 1 - Array of Arrays

[
  [ xxx, yyy, zzz, ...]
  [...]
]

Format 2 - Array of Hashes

[
   { fieldname: value, ... }
   {  ... }
]

Format 3: Data + Schema

This corresponds to adding some kind of explicit field / schema information to previous 2 formats:

{
  fields: [
   { field-descriptor },
   { field-descriptor }
  ],
  data: [
   as per either format 1 or format 2
  ]
}

A field descriptor is:

{
  id: field-name or column_no or field name is blank
  type: type of field
  label: (optional) human readable label if different from id (usually won't be here so should be omitted!)
  format: (optional) additional format info about field e.g. data format if known
}

type should be as per http://reclinejs.com/docs/models.html#types

Implementation Notes

For input parsing we probably want https://github.com/okfn/messytables

Reach out to potential users

eg. to Crisis Response team -- can we be put in touch? Somebody who has a need for data conversion as a service, and potential opinions about the API presentation.

Sketch out solution architecture

Had started this drawing (but this is for complex case):

Review Apache Tika and learn from it

See esp http://tika.apache.org/1.2/parser.html which has this as interface for the libraries

void parse(
    InputStream stream, ContentHandler handler, Metadata metadata,
    ParseContext context) throws IOException, SAXException, TikaException;

PDF -> ??

existing tools

https://github.com/euske/pdfminer

Establish test data and framework setup

Data

Suggest a dedicated s3 or google storage bucket

Better than google drive due to size (?)
Folder structure
See #21 - Shapefile
See #20 - CSV

Test framework

Separate from core converters - and ensure runs over http (?)

Defining a Web-Service API

What should the API of the conversion web-service look like?

Core arguments (input file, input format, output format)
- Do you always have to post the file content or can you provide a URL (If already online)
Callback versus URL versus simple (streaming) response
Extra arguments (specific to a converter) - e.g. projections for geo conversions
Response formats: In theory this is the name of the converter. However we may want to limit output formats to a very limited set e.g. CSV and JSON in which case this could be an argument (?)

Spec

Incoming data

May either be:

POSTed in the body of the request (?)
Provided via ?url attribute that points to the file online

Request arguments

Defined per converter

Response formats

Defined by the converter (?)

Support for CORS and JSONP

Support CORS (for python code see here)
Support JSONP via an optional ?callback=name parameter (for python code see here)

Webhooks / Callbacks

Do via headers as per Gut

Errors

...

Existing work

Gut

https://github.com/maxogden/gut

A gut web service (essentially a WebHook) has the following attributes:

accepts POST data containing a file or other data (such as CSV)
turns the incoming data into some other data format (e.g. an array of JSON objects)
sends the converted data back to the URL specified by the incoming requests HTTP "X-callback" header

dataproxy

https://github.com/okfn/dataproxy

Pass in data: via Query parameters (you don't POST data - rather it loads off the web -- this could be easily changed)
Returns data as the data in Response

XLS -> JSON

Just one sheet or all sheets? (dataproxy has ?worksheet= parameter)
What's the output format?
xlsx vs xls

Known Implementations

https://github.com/stephenjudkins/poisauce - a gut implementation
https://github.com/okfn/dataproxy

Libraries

xlrd (python)
POI (Java)
messytables (builds on xlrd)

Correct JSON output format

This corresponds to adding some kind of explicit field / schema information to previous 2 formats:

{
  fields: [
   { field-descriptor },
   { field-descriptor }
  ],
  data: [
   as per either format 1 or format 2
  ]
}

A field descriptor is:

{
  id: field-name or column_no or field name is blank
  type: type of field
  label: (optional) human readable label if different from id (usually won't be here so should be omitted!)
  format: (optional) additional format info about field e.g. data format if known
}

type should be as per http://reclinejs.com/docs/models.html#types

XML -> CSV

XML is very generic. We'd need a schema mapping (XSLT)?

Service front page

Simple elegant front page.

Question: do we provide a user interface for providing files (RP: I'd say yes)

Drag and drop
File upload
URL provision

This does get somewhat complicated once the upload forms have to support all the relevant options though ...

DB2 -> CSV (SDF)

Convert data if types have been guessed (csv parse)

So instead of

{u'date': u'2011-01-03', u'place': u'Berkeley', u'temperature': u'5'}

We would get

{u'date': '2011-01-03', u'place': u'Berkeley', u'temperature': 5}

Or some unambiguous string representation if converted to json.

RP: note that @domoritz originally suggested having date as datetime object. I think isoformat date(time) is better.

Add JSONP support

Are G***** Corporation happy for us to mention their involvement?

Required to handle #27.

Basic stats on data type frequency

Count by file type using search filters
Look at data.gov.XX (see http://dashboard.opengovernmentdata.org/)

Project intro / summary slidedeck

See the In progress slide deck

Shapefile -> Records (GeoJSON)

Issue for discussing a shapefile to GeoJSON.

Notes

A shapefile is actually comprised of at least 3 (and usually) 4 files (.shp, .shx, .dbf, and .prj). All of these are needed.

As such if a stream is provided it should be to a zipfile containing all of the relevant files. Otherwise we expect a path to the .shp file and assume all the other files are in that directory.

TODO: what about URLs? Should we support cases where shp files are in an online directory?

Implementation Options

Via GDAL

ogr2ogr -f geoJSON kc.json kc.shp

Could GDAL be used on a server? (Probably not on app-engine, Heroku etc)

Even better is to use the Fiona wrapper for GDAL: http://toblerity.github.com/fiona/

pyshp (Python)

Pure python code for reading and writing shape files.

http://code.google.com/p/pyshp/

Looks pretty good, see, for examples: http://code.google.com/p/pyshp/wiki/PyShpDocs

Note: not clear what it's projection support is (is this a problem for reading stuff ...?)

Some work would need to be done to convert the source data into nice geojson style python dicts etc.

Take a look at https://github.com/okfn/dataproxy - quite a bit to reuse

Documentation there could be reused (and it's quite nice - maybe
Core app is written using a slightly odd framework but you can ignore that and take a look at some of the logic e.g.
- checking size of file
- how transformation/converter are named and handed off too)
- csv transform: can at least parameters https://github.com/okfn/dataproxy/blob/master/dataproxy/transform/csv_transform.py
- Note that we should probably just switch to messytables for everything related to xls or csv!!

EU Nuts shapefiles https://github.com/datasets/reference-staging/tree/master/eu-nuts/nuts2-shapefile/data
Ask @amercader

Python Proposal

def convert(stream, metadata)

:param stream: file-like stream (only needs a read event)
:param metadata: metadata about the stream that could be relevant
   e.g. the types of the various fields (in case of e.g. CSV -> JSON)
:return (data, metadata, errors): data is a file-like stream in most cases but could also be e.g. JSON

What about allowing urls rather than just stream
Is synchronous (not asynchronous)
How are errors handled (major errors throw exceptions but what about e.g. validation errors?)
- Major errors that prevent continuing operation generate an Exception
- Minor errors e.g. fields that don't validate (if that is relevant) go into the errors option, in general at this point errors should be exceptions

class Converter:
    def convert(self, stream, metadata):
         return (data, metadata, errors)

Reference material

Tika