rufuspollock-okfn / dataconverters Goto Github PK
View Code? Open in Web Editor NEWPython library and command line tool for converting data from one format to another
Home Page: http://okfnlabs.org/dataconverters/
Python library and command line tool for converting data from one format to another
Home Page: http://okfnlabs.org/dataconverters/
Info wanted!
Wanted to get population per UK police force. Found
I would like to get this as CSV or JSON (in fact just second worksheet - though you don't know that until you open ...). I would like to either upload this to service or point at the URL and have this converted (BTW: not clear whether this is xls or xlsx from the extension ...)
See esp http://tika.apache.org/1.2/parser.html which has this as interface for the libraries
void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException;
Create a pluggable system for new converters and their tests.
What should the API of the conversion web-service look like?
May either be:
Do via headers as per Gut
...
https://github.com/maxogden/gut
A gut web service (essentially a WebHook) has the following attributes:
https://github.com/okfn/dataproxy
So instead of
{u'date': u'2011-01-03', u'place': u'Berkeley', u'temperature': u'5'}
We would get
{u'date': '2011-01-03', u'place': u'Berkeley', u'temperature': 5}
Or some unambiguous string representation if converted to json.
RP: note that @domoritz originally suggested having date as datetime object. I think isoformat date(time) is better.
Issue for discussing a shapefile to GeoJSON.
A shapefile is actually comprised of at least 3 (and usually) 4 files (.shp, .shx, .dbf, and .prj). All of these are needed.
As such if a stream is provided it should be to a zipfile containing all of the relevant files. Otherwise we expect a path to the .shp file and assume all the other files are in that directory.
TODO: what about URLs? Should we support cases where shp files are in an online directory?
ogr2ogr -f geoJSON kc.json kc.shp
Could GDAL be used on a server? (Probably not on app-engine, Heroku etc)
Even better is to use the Fiona wrapper for GDAL: http://toblerity.github.com/fiona/
Pure python code for reading and writing shape files.
http://code.google.com/p/pyshp/
Looks pretty good, see, for examples: http://code.google.com/p/pyshp/wiki/PyShpDocs
Note: not clear what it's projection support is (is this a problem for reading stuff ...?)
Some work would need to be done to convert the source data into nice geojson style python dicts etc.
The dataconverters library doesn't allow defining target format yet.
Create data converters service which consumes data-converters library
eg. to Crisis Response team -- can we be put in touch? Somebody who has a need for data conversion as a service, and potential opinions about the API presentation.
Tab separated value files are as common as csv files. This is a requirement for the ckan importer service.
Is this for ckanext-datastorer?
How do we handle large files and should we support streaming API of some form?
Had started this drawing (but this is for complex case):
node implementation: https://github.com/maxogden/node-mdb
used on http://filebakery.com
XML is very generic. We'd need a schema mapping (XSLT)?
This issue is about defining a library API for converters.
def convert(stream, metadata) :param stream: file-like stream (only needs a read event) :param metadata: metadata about the stream that could be relevant e.g. the types of the various fields (in case of e.g. CSV -> JSON) :return (data, metadata, errors): data is a file-like stream in most cases but could also be e.g. JSON
class Converter: def convert(self, stream, metadata): return (data, metadata, errors)
See esp http://tika.apache.org/1.2/parser.html which has this as interface for the libraries
void parse( InputStream stream, ContentHandler handler, Metadata metadata, ParseContext context) throws IOException, SAXException, TikaException;
https://developers.google.com/appengine/docs/python/conversion/overview
But what JSON schema exactly :-)
The question of output format is essentially a question of representation of tabular data in JSON and is therefore much more general than this specific converter. We enumerate various options. Issues to consider:
We are adopting Format 3
[ [ xxx, yyy, zzz, ...] [...] ]
[ { fieldname: value, ... } { ... } ]
This corresponds to adding some kind of explicit field / schema information to previous 2 formats:
{ fields: [ { field-descriptor }, { field-descriptor } ], data: [ as per either format 1 or format 2 ] }
A field descriptor is:
{ id: field-name or column_no or field name is blank type: type of field label: (optional) human readable label if different from id (usually won't be here so should be omitted!) format: (optional) additional format info about field e.g. data format if known }
Following discussions with CKAN team, change the data-converters project to a library.
This corresponds to adding some kind of explicit field / schema information to previous 2 formats:
{ fields: [ { field-descriptor }, { field-descriptor } ], data: [ as per either format 1 or format 2 ] }
A field descriptor is:
{ id: field-name or column_no or field name is blank type: type of field label: (optional) human readable label if different from id (usually won't be here so should be omitted!) format: (optional) additional format info about field e.g. data format if known }
Simple elegant front page.
Question: do we provide a user interface for providing files (RP: I'd say yes)
This does get somewhat complicated once the upload forms have to support all the relevant options though ...
Required to handle #27.
Brief intro plus the slide-deck?
Plus email to relevant lists
The file types for files POSTed to the converters API can be guessed from the mime type.
Useful for prioritisation and triage of other issues.
Suggest we just have a kml_parse method that yields iterator over geojson style python dicts.
Various options here.
Note this is not supported by Fiona bindings for gdal (atm)
Other libraries:
In order to further process some data, it is necessary that the types of the columns are in the results.
See the In progress slide deck
The API should also accept file content POSTed to it.
Document all existing features on read the docs.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.