meltwaterarchive / datasift-python Goto Github PK
View Code? Open in Web Editor NEWPython client to interface with DataSift
Home Page: http://datasift.com/
License: MIT License
Python client to interface with DataSift
Home Page: http://datasift.com/
License: MIT License
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/local/lib/python2.7/dist-packages/datasift-0.5.1-py2.7.egg/datasift/streamconsumer_http.py", line 127, in run
self._sock = resp.fp._sock.fp._sock
AttributeError: addinfourl instance has no attribute '_sock'
Any ideas?
A customer was using Python 3.4 in a Windows environment and getting a traceback ending with
from twisted.python import lockfile, failure
File "C:\Python34\lib\site-packages\twisted\python\lockfile.py", line 52, in <module>
_open = file
NameError: name 'file' is not defined
when importing the datasift module.
Apparently this is due to a known problem on Windows with Python 3.X (glyph is a core maintainer of twisted): http://www.scriptscoop.net/t/7d436f5544a8/twisted-work-with-python-3-3.html
Can we update the README advising Windows users to use Python 2.7?
Here's the diff.
--- init.py 2013-03-20 00:14:48.000000000 -0700
+++ init.py.patch 2013-03-20 00:20:18.000000000 -0700
@@ -1458,7 +1458,7 @@
if self._user.use_ssl():
protocol = 'https'
if isinstance(self._hashes, list):
return "%s://%smulti?hashes=%s" % (protocol, self._user.stream_base_url, ','.join(self._hashes))
return "%s://%smulti?hashes=%s" % (protocol, self._user._stream_base_url, ','.join(self._hashes))
else:
return "%s://%s%s" % (protocol, self._user._stream_base_url, self._hashes)
When getting the error: "The rate limit for twitter has been exceeded" the library repeatedly attempts to reconnect without regard for the connection delay.
To call validate, the params passed into the PushDefinition should have the "output_params." prefix. Whereas when the PushDefinition.subscribe call is made, the api prefixes the "output_params." string.
See https://travis-ci.org/datasift/datasift-python/jobs/60543799 for details
There does not currently seem to be a way to cleanly stop Client
once start_stream_subscriber()
has been called. Ctrl+C causes it to throw KeyboardInterrupt exceptions even if handled in the calling code. It looks like this might be because Client
uses twisted by calling reactor.run()
in client.py
but contains no code to call reactor.stop()
.
Now it's imposible to update the hash of an active push subscription ( maybe it's imposible because of the api ).
I think this code:
subscription = datasift_user.get_push_subscription(subscription_id)
subscription._hash = new_hash
subscription.save()
would work if you go to save method in the PushSubscription class :
def save(self):
"""
Save changes to the name and output parameters of this subscription.
"""
params = {
'id': self.get_id(),
'name': self.get_name()
}
for key in self.get_output_params():
params['%s%s' % (self.OUTPUT_PARAMS_PREFIX, key)] = self.get_output_param(key)
self._init(self._user.call_api('push/update', params))
and add hash to params:
def save(self):
"""
Save changes to the name and output parameters of this subscription.
"""
params = {
'id': self.get_id(),
'name': self.get_name(),
'hash': self._hash
}
for key in self.get_output_params():
params['%s%s' % (self.OUTPUT_PARAMS_PREFIX, key)] = self.get_output_param(key)
self._init(self._user.call_api('push/update', params))
Thanks
On Python 3 (3.4.3) a simple stream client like this:
from __future__ import print_function
from datasift import Client
ds = Client("Username", "API Key")
@ds.on_delete
def on_delete(interaction):
print( 'Deleted interaction %s ' % interaction)
@ds.on_open
def on_open():
print( 'Streaming ready, can start subscribing')
csdl = 'interaction.content contains "music"'
stream = ds.compile(csdl)['hash']
@ds.subscribe(stream)
def subscribe_to_hash(msg):
print(msg)
@ds.on_closed
def on_close(wasClean, code, reason):
print( 'Streaming connection closed')
@ds.on_ds_message
def on_ds_message(msg):
print( 'DS Message %s' % msg)
ds.start_stream_subscriber()
throws the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/multiprocessing/process.py", line 254, in _bootstrap
self.run()
File "/usr/local/lib/python3.4/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/datasift-python/datasift/client.py", line 237, in _stream
options = ssl.optionsForClientTLS(hostname=WEBSOCKET_HOST.decode("utf-8"))
AttributeError: 'str' object has no attribute 'decode'
The python client hardcodes a socket timeout of 5s when it runs the initial request (L75 of streamconsumer_http.py at the time of writing):
resp = urllib2.urlopen(req, None, 5)
This appears to cause problems later on with low-volume streams. Once a socket timeout has occurred, the stream appears to never receive any more data.
Removing that socket timeout fixes the problem, but then the stream thread is difficult to shut down without terminating the whole process.
How often does the DataSift streaming endpoint send its 'connected' keepalives? 2 x that value is probably a sensible value for the timeout, even if it does mean that a stream may take a long time to shut down.
➜ ~ sudo pip install datasift --upgrade
The directory '/Users/jason/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/jason/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting datasift
Downloading datasift-2.6.0.tar.gz
Requirement already up-to-date: requests<3.0.0,>=2.8.0 in /Library/Python/2.7/site-packages (from datasift)
Requirement already up-to-date: autobahn<0.10.0,>=0.9.4 in /Library/Python/2.7/site-packages (from datasift)
Collecting six<2.0.0,>=1.6.0 (from datasift)
Downloading six-1.10.0-py2.py3-none-any.whl
Collecting twisted<16.0.0,>=14.0.0 (from datasift)
Collecting pyopenssl<0.16.0,>=0.15.1 (from datasift)
Downloading pyOpenSSL-0.15.1-py2.py3-none-any.whl (102kB)
100% |████████████████████████████████| 106kB 1.3MB/s
Collecting python-dateutil<3,>=2.1 (from datasift)
Downloading python_dateutil-2.4.2-py2.py3-none-any.whl (188kB)
100% |████████████████████████████████| 192kB 1.3MB/s
Requirement already up-to-date: service-identity>=14.0.0 in /Library/Python/2.7/site-packages (from datasift)
Requirement already up-to-date: requests-futures>=0.9.5 in /Library/Python/2.7/site-packages (from datasift)
Collecting ndg-httpsclient>=0.4.0 (from datasift)
Downloading ndg_httpsclient-0.4.0.tar.gz
Collecting zope.interface>=3.6.0 (from twisted<16.0.0,>=14.0.0->datasift)
Collecting cryptography>=0.7 (from pyopenssl<0.16.0,>=0.15.1->datasift)
Downloading cryptography-1.1.1-cp27-none-macosx_10_6_intel.whl (1.3MB)
100% |████████████████████████████████| 1.3MB 359kB/s
Requirement already up-to-date: characteristic>=14.0.0 in /Library/Python/2.7/site-packages (from service-identity>=14.0.0->datasift)
Collecting pyasn1-modules (from service-identity>=14.0.0->datasift)
Downloading pyasn1_modules-0.0.8-py2.py3-none-any.whl
Collecting pyasn1 (from service-identity>=14.0.0->datasift)
Downloading pyasn1-0.1.9-py2.py3-none-any.whl
Requirement already up-to-date: futures>=2.1.3 in /Library/Python/2.7/site-packages (from requests-futures>=0.9.5->datasift)
Collecting setuptools (from zope.interface>=3.6.0->twisted<16.0.0,>=14.0.0->datasift)
Downloading setuptools-18.6.1-py2.py3-none-any.whl (462kB)
100% |████████████████████████████████| 462kB 1.0MB/s
Collecting enum34 (from cryptography>=0.7->pyopenssl<0.16.0,>=0.15.1->datasift)
Collecting ipaddress (from cryptography>=0.7->pyopenssl<0.16.0,>=0.15.1->datasift)
Downloading ipaddress-1.0.15-py27-none-any.whl
Collecting idna>=2.0 (from cryptography>=0.7->pyopenssl<0.16.0,>=0.15.1->datasift)
Downloading idna-2.0-py2.py3-none-any.whl (61kB)
100% |████████████████████████████████| 61kB 2.4MB/s
Collecting cffi>=1.1.0 (from cryptography>=0.7->pyopenssl<0.16.0,>=0.15.1->datasift)
Downloading cffi-1.3.1-cp27-none-macosx_10_10_intel.whl (192kB)
100% |████████████████████████████████| 196kB 1.8MB/s
Collecting pycparser (from cffi>=1.1.0->cryptography>=0.7->pyopenssl<0.16.0,>=0.15.1->datasift)
Installing collected packages: six, setuptools, zope.interface, twisted, enum34, ipaddress, pyasn1, idna, pycparser, cffi, cryptography, pyopenssl, python-dateutil, ndg-httpsclient, datasift, pyasn1-modules
Found existing installation: six 1.6.1
Uninstalling six-1.6.1:
Successfully uninstalled six-1.6.1
Found existing installation: setuptools 18.0.1
Uninstalling setuptools-18.0.1:
Successfully uninstalled setuptools-18.0.1
Found existing installation: zope.interface 4.1.2
Uninstalling zope.interface-4.1.2:
Successfully uninstalled zope.interface-4.1.2
Found existing installation: Twisted 14.0.2
Uninstalling Twisted-14.0.2:
Successfully uninstalled Twisted-14.0.2
Found existing installation: pyasn1 0.1.8
Uninstalling pyasn1-0.1.8:
Successfully uninstalled pyasn1-0.1.8
Found existing installation: pyOpenSSL 0.13.1
DEPRECATION: Uninstalling a distutils installed project (pyopenssl) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
Uninstalling pyOpenSSL-0.13.1:
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip/basecommand.py", line 211, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip/commands/install.py", line 311, in run
root=options.root_path,
File "/Library/Python/2.7/site-packages/pip/req/req_set.py", line 640, in install
requirement.uninstall(auto_confirm=True)
File "/Library/Python/2.7/site-packages/pip/req/req_install.py", line 716, in uninstall
paths_to_remove.remove(auto_confirm)
File "/Library/Python/2.7/site-packages/pip/req/req_uninstall.py", line 125, in remove
renames(path, new_path)
File "/Library/Python/2.7/site-packages/pip/utils/__init__.py", line 315, in renames
shutil.move(old, new)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302, in move
copy2(src, real_dst)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 131, in copy2
copystat(src, dst)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 103, in copystat
os.chflags(dst, st.st_flags)
OSError: [Errno 1] Operation not permitted: '/tmp/pip-awPwpa-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/pyOpenSSL-0.13.1-py2.7.egg-info'
The error I get is -
Traceback (most recent call last):
File "./run-historic.py", line 6, in
user.list_historics()
File "/usr/local/lib/python2.7/dist-packages/datasift-0.5.3-py2.7.egg/datasift/init.py", line 198, in list_historics
return Historic.list(self, page, per_page)
File "/usr/local/lib/python2.7/dist-packages/datasift-0.5.3-py2.7.egg/datasift/init.py", line 523, in list
retval['historics'].append(Historic(user, historic))
File "/usr/local/lib/python2.7/dist-packages/datasift-0.5.3-py2.7.egg/datasift/init.py", line 541, in init
self._init(hash)
File "/usr/local/lib/python2.7/dist-packages/datasift-0.5.3-py2.7.egg/datasift/init.py", line 612, in _init
raise InvalidDataError('The volume info is missing')
datasift.InvalidDataError: The volume info is missing
With the combination
target top_level : li.all.mentions.company_name
target child_1: li.subtype
target child_2: li.user.type
(actually for every combination with that top level target) the library raise an exception
DataSiftApiException: The analysis configuration contains an invalid target: li.all.mentions.company_name
Our request json/dict is the following
{
"parameters":{
"child":{
"child":{
"parameters":{
"threshold":5,
"target":"li.user.type"
},
"analysis_type":"freqDist"
},
"parameters":{
"threshold":10,
"target":"li.subtype"
},
"analysis_type":"freqDist"
},
"parameters":{
"threshold":200,
"target":"li.all.mentions.company_name"
},
"analysis_type":"freqDist"
}
}
service is linkedin.
This same call unexpectedly works using Pylon web interface.
What's happening?
My pull method is returning:
"twitter": {
"created_at": "Mon, 17 Mar 2014 14:29:24 +0000",
"filter_level": "medium", ...
output_mapper is expecting "created_at" to be in date_handler_short format: ""%Y-%m%d %H:%M:%S" but my results are in long format "%a, %d %b %Y %H:%M:%S +0000"
Getting error: ValueError: time data 'Mon, 17 Mar 2014 15:01:08 +0000' does not match format '%Y-%m-%d %H:%M:%S'
Reading from here
http://dev.datasift.com/docs/platform/api/rest-api/endpoints
the currently endpoints used in
https://github.com/datasift/datasift-python/blob/master/datasift/pylon_task.py
needs to be updated, actually task {type} is not considered
The User.create_definition() method requires a bytestring - there's an explicit isinstance check for str. Ideally it should also be able to take unicode as well, and encode it itself - that means the application can focus on the data rather than worry about encoding and decoding.
If I get time, I'll prepare a small patch.
Originally raised in http://dev.datasift.com/discussions/python-api-windows
I have started to used the Python API and it seems to be working fine. However, I have tried to run some of the examples in the documentation with little success.
I am specially interested in the Live-Stream example.
Once I run it, I have this error
RuntimeError:
Attempt to start a new process before the current process
has finished its bootstrapping phase.
This probably means that you are on Windows and you have
forgotten to use the proper idiom in the main module:
if __name__ == '__main__':
freeze_support()
...
The "freeze_support()" line can be omitted if the program
is not going to be frozen to produce a Windows executable.
I know that I managed to connect to my account and fetch the status data, so It is not a problem of connectivity.
I am a bit new so I am might be missing something basic, but the documentation is scarce for Python users. Any hint and or direction will be really appreciate it.
Hello!
I can't install the package due to a conflict with my openssl version and the version of pyopenssl you've pinned. Would it be possible to bump it to a version around 0.15
as that installs fine? :)
I managed to get the package installed by using --no-deps
and then installing pyopenssl before datasift :)
Here's the related issue pyca/pyopenssl#276
Automated builds triggered by pull requests fail with the error:
Please export a github OAUTH token as GITHUB_TOKEN to run these tests
The successful builds from master
all export GITHUB_TOKEN
as part of the step:
Setting environment variables from .travis.yml
Hi,
This is rather a server side issue but I figured that if the changes go through on the server side, the client would have some modifications as well (hopefully in the near future)
So currently the way validate and compile endpoints work is to accept POST request with URL parameters specifying the CSDL to be validated/compiled. This works great in normal cases. However, if the entity is very large for URL parameters to handle, then by HTTP's nature, I'd get a 414 Request URI too long error (which, by the way, is not handled by DataSift's API endpoints, I'd still get a header response code 200). The real solution, imho, would be accepting HTTP body payload on the server side, this is what POST mainly used for anyways.
Kindly,
woozyking
If a user tries to access an endpoint they do not have access to the Python client throws an error along the lines of:
Traceback (most recent call last):
File "historics.py", line 15, in <module>
print(datasift.historics.status(start, end_time))
File "/Library/Python/2.7/site-packages/datasift/historics.py", line 102, in status
return self.request.get('status', params=params)
File "/Library/Python/2.7/site-packages/datasift/request.py", line 39, in get
return self.build_response(self('get', path, params=params, headers=headers), path=path)
File "/Library/Python/2.7/site-packages/datasift/request.py", line 84, in build_response
if int(response.headers.get("x-ratelimit-cost")) > int(response.headers.get("x-ratelimit-remaining")):
TypeError: int() argument must be a string or a number, not 'NoneType'
A new type of exception needs to be thrown when it sees '"error":"You do not have permission to access this endpoint"' as a response.
With the code that I can find in the quickstart tutorial I have this error:
_Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/process.py", line 254, in _bootstrap
self.run()
File "/usr/lib/python3.4/multiprocessing/process.py", line 93, in run
self._target(_self._args, *_self._kwargs)
File "/usr/local/lib/python3.4/dist-packages/datasift/client.py", line 237, in stream
options = ssl.optionsForClientTLS(hostname=WEBSOCKET_HOST.decode("utf-8"))
AttributeError: 'str' object has no attribute 'decode'
If I change in the tutorial's code
client = datasift.Client('DATASIFT_USERNAME', 'DATASIFT_API_KEY')
with
client = datasift.Client(b'DATASIFT_USERNAME', b'DATASIFT_API_KEY')
I have this error:
Traceback (most recent call last):
File "tutorial_datasift.py", line 9, in
fltr = client.compile(csdl)
File "/usr/local/lib/python3.4/dist-packages/datasift/client.py", line 256, in compile
return self.request.post('compile', data=dict(csdl=csdl))
File "/usr/local/lib/python3.4/dist-packages/datasift/request.py", line 46, in post
return self.build_response(self('post', path, params=params, headers=headers, data=data), path=path)
File "/usr/local/lib/python3.4/dist-packages/datasift/request.py", line 86, in build_response
raise AuthException(data)
datasift.exceptions.AuthException: {'error': 'Authorization failed'}
If I change in client.py line 237
options = ssl.optionsForClientTLS(hostname=WEBSOCKET_HOST.decode("utf-8"))
with
options = ssl.optionsForClientTLS(hostname=WEBSOCKET_HOST)
it seems run.
I am facing this issue after subscribing to stream "Stream subscriber shutting down because connection was closed uncleanly (peer dropped the TCP connection without previous WebSocket closing handshake)" .
Warning while importing datasift
c:\Python27\lib\site-packages\zope.interface-4.1.2-py2.7-win32.egg\zope__init__.py:3: UserWarning: Module twisted was already imported from c:\Python27\lib\site-packages\twisted__init__.pyc, but c:\python27\lib\site-packages\autobahn-0.9.6-py2.7.egg is being added to sys.path
import pkg_resources
c:\Python27\lib\site-packages\twisted\internet\win32eventreactor.py:64: UserWarning: Reliable disconnection notification requires pywin32 215 or later
category=UserWarning)
Can someone suggest a fix for this ?
The dictionary that we get from client.account.identity.list()
has an updated_at
value that is an integer, not a datetime object as would be expected.
{
"api_key": "dff990e42c14ef5d5aa280b0e9fea9e2",
"created_at": "Wed, 13 May 2015 10:46:05 GMT",
"expires_at": null,
"id": "5dbb799eea004fcb3e2d999d767e0a20",
"label": "DataSift",
"master": true,
"status": "active",
"updated_at": 1440604653
}
On receiving the following error:
{"status":"failure","message":"You have insufficient credits available to consume the stream"}
Python Lib continues trying to reconnect. Should receive this message, and stop reconnection attempts. Check the same is true when trying to send invalid auth credentials
A large try/except captures much of the reading code, with the except line here:
https://github.com/datasift/datasift-python/blob/develop/datasift/streamconsumer_http.py#L110
This catches everything, including errors in client's handler code, making debugging much harder.
The try/except should be more targetted in terms of the code that it surrounds, and the exception type should be a lot more specific (perhaps just socket exceptions, if that's what it's trying to catch - which is what is implied by the error message.)
When running python setup.. library throws authentication error : operator not validate, shutting down to avoid lockup or fatal exception.
I've been unable to update a subscription's output params by calling client.push.update
. I receive a response with status code 200 but the output_params of the subscription remain unchanged.
I'm passing the following dict to push.update
(sensible values edited away)
output_params: {
'host': 'host.example.com',
'port': 22,
'auth': {
'username': 'my_username',
'password': 'my_password'
},
'directory': '/path/to/datasift/files',
'file_prefix': 'datasift',
'format': 'json_meta',
'delivery_frequency': 300,
'max_size': 10485760,
'mark_in_progress': 0
}
A prior call to push.validate
with this same dict returns Validation Successful.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.