Ian,
I've been experimenting a bit over the past week with your excellent ckanapi. I'm hitting some obstacles in handling large files though, in terms of running out of memory on a 4GB ubuntu VM and I was hoping you might be able point me in the right direction to try to resolve. In order to debug, I've been using an ipython notebook using pandas and ckanapi, as well as a few other modules. My use case is to be able to push 500k records (22 fields) to a datastore table. This equates to about 140MB in csv form and >300MB in what I think is the equivalent of your jsonl format.
I abandoned trying to feed a file upload, since CKAN doesn't even attempt to ingest this file size. I also tried pointing at a URL to the file on S3, but again, datapusher doesn't even try to tackle this.
So I'm trying to use the datastore action API commands. At present, in order to prevent the kernel on the notebook from constantly crashing from being out of memory, I'm splitting a dataframe containing the circa 500k records into various pieces ... adding a single line in conjunction with a datastore_create and then the rest in 100k chunks using datastore ... and while this kind of works ... it still comes back with a 504 error (see bottom) and tries to emit back all the added records in the 'out' cell in the notebook (I'm not sure if there's a recommended way to suppress this).
dfprev1=dfprev[:1]
dfprev2=dfprev[2:100000]
dfprev3=dfprev[100001:200000]
dfprev4=dfprev[200001:300000]
dfprev5=dfprev[300001:400000]
dfprev6=dfprev[400001:]
This is some of the code I'm using to transform the dataframe object in pandas into something equivalent to jsonl. I referenced this SO thread: http://stackoverflow.com/questions/20639631/how-to-convert-pandas-dataframe-to-the-desired-json-format
output = StringIO.StringIO() # a stringio used to convert to jsonl
dfprev6.to_json(path_or_buf=output, date_format='iso', orient='records') #dfprev6 is a slice of the larger dataframe
contents = output.getvalue() # this is used to bring back the json
records_new = pd.json.loads(contents) # and then assign this to a records string
mysite.action.datastore_upsert(resource_id='0a8462d3-4c81-474a-bf84-3f2941ac67c0',
records=records_new,
force=True, primary_key=['ID_BB_GLOBAL'])
Presumably some sort of streaming method to feed the large dataframe in chunks and passing this to the ckanapi would work better, but I'm not sure the best approach. I was wondering if you might have some sample code that would achieve this.
In your readme, you have some command-line examples of feeding a jsonl file. It's not clear to me if I can use this for datastore data ... if so, would I include a resource_id at the beginning of the file? I couldn't find a sample jsonl file in the repo to see the structure that would include ckan meta-data. I can see from jsonl.org that the format I generate and assign to records (described above) aligns. Would I still have to find some mechanism to chunk the data to overcome the memory issues?
Thanks for your input on this. Colum
This is the 504 error I get back when trying to push 100k records in increments using datastore_upsert.
CKANAPIError Traceback (most recent call last)
in ()
1 mysite.action.datastore_upsert(resource_id='0a8462d3-4c81-474a-bf84-3f2941ac67c0',
2 records=records_new,
----> 3 force=True, primary_key=['ID_BB_GLOBAL'])
/usr/local/lib/python2.7/dist-packages/ckanapi-3.3_dev-py2.7.egg/ckanapi/common.pyc in action(**kwargs)
48 data_dict=nonfiles,
49 files=files)
---> 50 return self._ckan.call_action(name, data_dict=kwargs)
51 return action
52
/usr/local/lib/python2.7/dist-packages/ckanapi-3.3_dev-py2.7.egg/ckanapi/remoteckan.pyc in call_action(self, action, data_dict, context, apikey, files)
80 else:
81 status, response = self._request_fn(url, data, headers, files)
---> 82 return reverse_apicontroller_action(url, status, response)
83
84 def _request_fn(self, url, data, headers, files):
/usr/local/lib/python2.7/dist-packages/ckanapi-3.3_dev-py2.7.egg/ckanapi/common.pyc in reverse_apicontroller_action(url, status, response)
104
105 # don't recognize the error
--> 106 raise CKANAPIError(repr([url, status, response]))
CKANAPIError: ['http://172.17.0.2/api/action/datastore_upsert', 504, u'\r\n<title>504 Gateway Time-out</title>\r\n\r\n
504 Gateway Time-out
\r\n
nginx/1.1.19\r\n\r\n\r\n']