rufuspollock-okfn / dataconverters Goto Github PK
View Code? Open in Web Editor NEWPython library and command line tool for converting data from one format to another
Home Page: http://okfnlabs.org/dataconverters/
Python library and command line tool for converting data from one format to another
Home Page: http://okfnlabs.org/dataconverters/
In order to further process some data, it is necessary that the types of the columns are in the results.
Wanted to get population per UK police force. Found
I would like to get this as CSV or JSON (in fact just second worksheet - though you don't know that until you open ...). I would like to either upload this to service or point at the URL and have this converted (BTW: not clear whether this is xls or xlsx from the extension ...)
The file types for files POSTed to the converters API can be guessed from the mime type.
The API should also accept file content POSTed to it.
Info wanted!
Useful for prioritisation and triage of other issues.
The dataconverters library doesn't allow defining target format yet.
But what JSON schema exactly :-)
The question of output format is essentially a question of representation of tabular data in JSON and is therefore much more general than this specific converter. We enumerate various options. Issues to consider:
We are adopting Format 3
[ [ xxx, yyy, zzz, ...] [...] ]
[ { fieldname: value, ... } { ... } ]
This corresponds to adding some kind of explicit field / schema information to previous 2 formats:
{ fields: [ { field-descriptor }, { field-descriptor } ], data: [ as per either format 1 or format 2 ] }
A field descriptor is:
{ id: field-name or column_no or field name is blank type: type of field label: (optional) human readable label if different from id (usually won't be here so should be omitted!) format: (optional) additional format info about field e.g. data format if known }
eg. to Crisis Response team -- can we be put in touch? Somebody who has a need for data conversion as a service, and potential opinions about the API presentation.
Had started this drawing (but this is for complex case):
See esp http://tika.apache.org/1.2/parser.html which has this as interface for the libraries
void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException;
What should the API of the conversion web-service look like?
May either be:
Do via headers as per Gut
...
https://github.com/maxogden/gut
A gut web service (essentially a WebHook) has the following attributes:
https://github.com/okfn/dataproxy
This corresponds to adding some kind of explicit field / schema information to previous 2 formats:
{ fields: [ { field-descriptor }, { field-descriptor } ], data: [ as per either format 1 or format 2 ] }
A field descriptor is:
{ id: field-name or column_no or field name is blank type: type of field label: (optional) human readable label if different from id (usually won't be here so should be omitted!) format: (optional) additional format info about field e.g. data format if known }
XML is very generic. We'd need a schema mapping (XSLT)?
Simple elegant front page.
Question: do we provide a user interface for providing files (RP: I'd say yes)
This does get somewhat complicated once the upload forms have to support all the relevant options though ...
So instead of
{u'date': u'2011-01-03', u'place': u'Berkeley', u'temperature': u'5'}
We would get
{u'date': '2011-01-03', u'place': u'Berkeley', u'temperature': 5}
Or some unambiguous string representation if converted to json.
RP: note that @domoritz originally suggested having date as datetime object. I think isoformat date(time) is better.
Required to handle #27.
See the In progress slide deck
Issue for discussing a shapefile to GeoJSON.
A shapefile is actually comprised of at least 3 (and usually) 4 files (.shp, .shx, .dbf, and .prj). All of these are needed.
As such if a stream is provided it should be to a zipfile containing all of the relevant files. Otherwise we expect a path to the .shp file and assume all the other files are in that directory.
TODO: what about URLs? Should we support cases where shp files are in an online directory?
ogr2ogr -f geoJSON kc.json kc.shp
Could GDAL be used on a server? (Probably not on app-engine, Heroku etc)
Even better is to use the Fiona wrapper for GDAL: http://toblerity.github.com/fiona/
Pure python code for reading and writing shape files.
http://code.google.com/p/pyshp/
Looks pretty good, see, for examples: http://code.google.com/p/pyshp/wiki/PyShpDocs
Note: not clear what it's projection support is (is this a problem for reading stuff ...?)
Some work would need to be done to convert the source data into nice geojson style python dicts etc.
Tab separated value files are as common as csv files. This is a requirement for the ckan importer service.
Create a pluggable system for new converters and their tests.
Is this for ckanext-datastorer?
This issue is about defining a library API for converters.
def convert(stream, metadata) :param stream: file-like stream (only needs a read event) :param metadata: metadata about the stream that could be relevant e.g. the types of the various fields (in case of e.g. CSV -> JSON) :return (data, metadata, errors): data is a file-like stream in most cases but could also be e.g. JSON
class Converter: def convert(self, stream, metadata): return (data, metadata, errors)
See esp http://tika.apache.org/1.2/parser.html which has this as interface for the libraries
void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException;
https://developers.google.com/appengine/docs/python/conversion/overview
Suggest we just have a kml_parse method that yields iterator over geojson style python dicts.
Various options here.
Note this is not supported by Fiona bindings for gdal (atm)
Other libraries:
Following discussions with CKAN team, change the data-converters project to a library.
Create data converters service which consumes data-converters library
node implementation: https://github.com/maxogden/node-mdb
used on http://filebakery.com
How do we handle large files and should we support streaming API of some form?
Brief intro plus the slide-deck?
Plus email to relevant lists
Document all existing features on read the docs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.