trifle / twitterresearch Goto Github PK
View Code? Open in Web Editor NEWA starter kit with code for data collection, preparation, and analysis of digital trace data collected on Twitter
License: Other
A starter kit with code for data collection, preparation, and analysis of digital trace data collected on Twitter
License: Other
Currently, when querying Twitter's Streaming APIs the process stops in direct reaction to its manual cancelation. This might be problematic as resulting .json files might end up with a final incomplete tweet thereby throwing subsequent analytical steps.
Here, it would be excellent to have a safeguard. For example by guaranteeing that the last tweet in any .json file is checked on completeness or by automatically deleting the last line of any .json file.
Develop more telling and more prominently positioned error messages.
Text export of top retweets from database breaks off because of unicode encoding issues. This cannot be fixed directly on user machines.
Track usage of credentials to warn users if they attempt to use the same set multiple times.
I have the following error:
TypeError Traceback (most recent call last)
in ()
----> 1 examples.print_user_archive()
/home/ubuntu/twitterresearch/examples.pyc in print_user_archive()
111 """
112 archive_generator = rest.fetch_user_archive("lessig")
--> 113 for page in archive_generator:
114 for tweet in page:
115 print_tweet(tweet)
/home/ubuntu/twitterresearch/rest.pyc in fetch_user_archive(user, **kwargs)
261 # If we have a valid max_id, use that; else do a simple normal request
262 result, tweets = fetch_user_tweets(
--> 263 user, max_id=max_id, **kwargs) if max_id else fetch_user_tweets(user)
264 # Set the status variable - if it's not 200, that's an error and the loop exits
265 status = result.status_code
/home/ubuntu/twitterresearch/rest.pyc in fetch_user_tweets(user, **kwargs)
237 elif isinstance(user, str):
238 kwargs['screen_name'] = user
--> 239 result = throttled_call(USER_TIMELINE_URL, params=kwargs)
240 # Decode JSON
241 return (result, json.loads(result.text))
/home/ubuntu/twitterresearch/rest.pyc in wrapper(*args, **kwargs)
141 patched_data = lengthen_text(response_json)
142 # Monkey patching since .text and .content are read-only
--> 143 response._content = bytes(json.dumps(patched_data), encoding='utf-8')
144 return response
145 return wrapper
TypeError: str() takes at most 1 argument (2 given)
Thanks for the tutorial!
I'm having problems getting started. I've entered my tokens in keys.yaml in the main project directory but I keep getting IrrecoverableStreamException Error. The error message is below. (BTW keys.yaml refers to the tokens in a different way to the twitter developer site - I've tried different combinations though and still can't get it to work)
client_key: is this the same as the consumer key?
client_secret:
resource_owner_key: is this the same as the access token?
resource_owner_secret
Very grateful for your help!
In [25]: examples.track_keywords()
ERROR:root:Connection HTTP error 401
-----------------------------------------------------------------------
IrrecoverableStreamException Traceback (most recent call last)
<ipython-input-25-e0f81e39dd68> in <module>()
----> 1 examples.track_keywords()
~/Desktop/DS/Twitter/twitterresearch/examples.py in track_keywords()
159 keywords = ["politics", "election"]
160 stream = streaming.stream(
--> 161 on_tweet=print_tweet, on_notification=print_notice, track=keywords)
162
163
~/Desktop/DS/Twitter/twitterresearch/streaming.py in stream(on_tweet, on_notification, track, follow)
167 if stream.status_code != 200:
168 stream.close()
--> 169 backoff(int(stream.status_code))
170 try:
171 for line in stream.iter_lines():
~/Desktop/DS/Twitter/twitterresearch/streaming.py in backoff(errorcode)
117 elif errorcode in [401, 403, 404, 406, 413, 416]:
118 logging.error(u"Connection HTTP error {0}".format(errorcode))
--> 119 raise IrrecoverableStreamException
120 # We don't handle any other errors
121 else:
IrrecoverableStreamException:
After the issue "Persistent Error in the use of import.json() #8" a new error emerged when trying to write data into a database. This time the function "deduplicate_lowercase(tags)" runs in an error:
187 valid = filter(None, l)
188 lowercase = [e.lower() for e in valid]
189 if len(valid) != len(lowercase):
190 logging.warning("The input file had {0} empty lines, skipping those. Please verify that it is complete and valid.".format(len(lowercase) - len(valid)))
191 deduplicated = list(set(valid))
TypeError: object of type 'filter' has no len()
Dear developers,
it would be very helpful if you could include an opportunity to search the rest API after certain keywords and enable the user of your script package to also extract and save these results.
`Date and time are crucial to almost all kinds of data analysis, but at the same time they are notoriously difficult to handle. There are two major sources of confusion and errors, which correspond to the layers of modifications that are performed to get from an universal reference time to local time: (a) a timezone offset is added to a reference time to get local time and (b) if applicable, a daylight saving offset yields the seasonally correct time. (This presentation on date and time may be helpful: [http://www.youtube.com/watch?v=ZroB-e4RXmo]).
When working with twitter, it is crucial to keep two things in mind and separate: The Twitter API always returns date/time in UTC, the universal reference time (no timezone offset, no daylight saving time!). However, depending on your object of study, users will see and use their local time. In many cases, we need to convert the native UTC dates into some other format. Some best practices for handling this are:
Currently, retweet_links in network.py is quite slow.
This might be both due to the costly (and non-optimal?) SQL query and the subsequent mapping/writing stages.
Hi, I'm having an issues running the examples part of twitter_auth.py - It always comes up as a StopIteration error when I'm attempting to 'import examples' to run the demonstrations on the Republican Debates
More specifically, when rest.py is imported and calls the authorize method from twitter_auth.py. [ i.e. auth = twitter_auth.authorize() ], it doesn't work for me. Putting the path of my local copy of keys.yaml as a string before (and in place of) the 'find_keyfile' function didn't work either. It is always a StopIteration error...
I don't understand because the code shouldn't even be reaching the next() operator.... very confused.
Hi - Thanks for your help on the last issue. I have made a few steps forward but I am now having issues with the examples.save_track_keywords() function:
When run through IPython in the command line, I receive the pre-formatted error message suggested by the exception handling in the streaming.stream() :
In [2]: import examples
In [3]: examples.save_track_keywords()
ERROR:root:Error! Encountered Exception write() argument must be str, not dict b
ut continuing in order not to drop stream,
.... (1 for each tweet until keyboard interrupt)
KeyboardInterrupt
I'm assuming this is referring to the save_tweet() function .... I've tried tinkering with the format of the 'tweet' parameter to put it in string format but that hasn't worked, and neither has changing the write function to json.dump()
How to proceed?
When using the function import.json() there appears a specific error repeatedly across various users. The function returns an error when trying to add data from json files to the database field "url". The error appears in the attempt to deduplicate the input and reads "NoneType" object has no attribute "lower".
As a quick fix, we simply stopped trying to fill the respective database field. Following this, the problem disappeared and the rest of the data loads perfectly. But as this hinders the analysis of hyperlinks in tweets it would be perfect if we would find a better solution.
I'm going through the tutorial but the export_retweet_text function is throwing an error in peewee. As I'm not familiar with SQL syntax or the peewee library I wondered if someone could give me a hint on debugging or resolving this issue?
here's my input. All the other functions so far have worked fine.
examples.export_retweet_text()
and here's the error message I get:
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/peewee.py in get(self)
3219 try:
-> 3220 return next(clone.execute())
3221 except StopIteration:
playhouse/_speedups.pyx in playhouse._speedups._QueryResultWrapper.__next__()
playhouse/_speedups.pyx in playhouse._speedups._QueryResultWrapper.iterate()
StopIteration:
During handling of the above exception, another exception occurred:
TweetDoesNotExist Traceback (most recent call last)
<ipython-input-31-1934918c595d> in <module>()
----> 1 examples.export_retweet_text()
2 #retweet_text = pd.read_csv("retweet_text.csv")
~/Desktop/DS/Twitter/twitterresearch/examples.py in export_retweet_text(n)
496 database.Tweet.retweet.is_null(False)).group_by(database.Tweet.retweet)
497 for tweet in retweets:
--> 498 rt_counts[tweet.retweet.id] = tweet.retweet.retweets.count()
499 from collections import Counter
500 c = Counter(rt_counts)
~/anaconda3/lib/python3.6/site-packages/peewee.py in __get__(self, instance, instance_type)
1384 def __get__(self, instance, instance_type=None):
1385 if instance is not None:
-> 1386 return self.get_object_or_id(instance)
1387 return self.field
1388
~/anaconda3/lib/python3.6/site-packages/peewee.py in get_object_or_id(self, instance)
1375 if rel_id is not None or self.att_name in instance._obj_cache:
1376 if self.att_name not in instance._obj_cache:
-> 1377 obj = self.rel_model.get(self.field.to_field == rel_id)
1378 instance._obj_cache[self.att_name] = obj
1379 return instance._obj_cache[self.att_name]
~/anaconda3/lib/python3.6/site-packages/peewee.py in get(cls, *query, **kwargs)
4986 if kwargs:
4987 sq = sq.filter(**kwargs)
-> 4988 return sq.get()
4989
4990 @classmethod
~/anaconda3/lib/python3.6/site-packages/peewee.py in get(self)
3222 raise self.model_class.DoesNotExist(
3223 'Instance matching query does not exist:\nSQL: %s\nPARAMS: %s'
-> 3224 % self.sql())
3225
3226 def peek(self, n=1):
TweetDoesNotExist: Instance matching query does not exist:
SQL: SELECT "t1"."id", "t1"."user_id", "t1"."text", "t1"."date", "t1"."language_id", "t1"."reply_to_user_id", "t1"."reply_to_tweet", "t1"."retweet_id" FROM "tweet" AS t1 WHERE ("t1"."id" = ?)
PARAMS: [0]
Dear everybody.
I visited a course at the university and came in contact with twitterresearch. Meanwhile I did some small analysis for the homework. For my master thesis I would like to collect all the tweets with #nobillag and #probillag. In Switzerland there is an voting about national media support on March 4th. I wanted to collect everything until then and this is not possible with my laptop. That's why I bought a Raspberry Pi 3 for that.
Do you know, why the same script does not work on the rpi that works on the macbook pro? I have ipython, did the preparation of the workspace as you tell to do in the tutorial.
Thank you for your help!
Here you find the nobillag.py file:
https://www.dropbox.com/s/mfenuty0rswpdly/nobillag.py?dl=0
Dear all
It's me again with an urgent matter. I was successully downloading the tweets about the World Economic Forum in Davos with the keywords 'wef2018' and 'davos2018'.
After almost one week and approximately 1GB data the datas were not increasing anymore. I am using an ubuntu webserver and two other scripts are running aside on the same server. Is that the problem?
Does it have to do with the increasing amount of tweets during the WEF? I am happy if you can help me soon, because the WEF is in it's second day today.
Big thanx!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.