GithubHelp home page GithubHelp logo

trifle / twitterresearch Goto Github PK

View Code? Open in Web Editor NEW
43.0 43.0 27.0 4.65 MB

A starter kit with code for data collection, preparation, and analysis of digital trace data collected on Twitter

License: Other

Python 87.55% R 12.45%

twitterresearch's People

Contributors

millesimus avatar trifle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

twitterresearch's Issues

Avoid incomplete tweets when querying Twitter's Streaming APIs

Currently, when querying Twitter's Streaming APIs the process stops in direct reaction to its manual cancelation. This might be problematic as resulting .json files might end up with a final incomplete tweet thereby throwing subsequent analytical steps.

Here, it would be excellent to have a safeguard. For example by guaranteeing that the last tweet in any .json file is checked on completeness or by automatically deleting the last line of any .json file.

examples.print_user_archive()

I have the following error:

In [5]: examples.print_user_archive()

TypeError Traceback (most recent call last)
in ()
----> 1 examples.print_user_archive()

/home/ubuntu/twitterresearch/examples.pyc in print_user_archive()
111 """
112 archive_generator = rest.fetch_user_archive("lessig")
--> 113 for page in archive_generator:
114 for tweet in page:
115 print_tweet(tweet)

/home/ubuntu/twitterresearch/rest.pyc in fetch_user_archive(user, **kwargs)
261 # If we have a valid max_id, use that; else do a simple normal request
262 result, tweets = fetch_user_tweets(
--> 263 user, max_id=max_id, **kwargs) if max_id else fetch_user_tweets(user)
264 # Set the status variable - if it's not 200, that's an error and the loop exits
265 status = result.status_code

/home/ubuntu/twitterresearch/rest.pyc in fetch_user_tweets(user, **kwargs)
237 elif isinstance(user, str):
238 kwargs['screen_name'] = user
--> 239 result = throttled_call(USER_TIMELINE_URL, params=kwargs)
240 # Decode JSON
241 return (result, json.loads(result.text))

/home/ubuntu/twitterresearch/rest.pyc in wrapper(*args, **kwargs)
141 patched_data = lengthen_text(response_json)
142 # Monkey patching since .text and .content are read-only
--> 143 response._content = bytes(json.dumps(patched_data), encoding='utf-8')
144 return response
145 return wrapper

TypeError: str() takes at most 1 argument (2 given)

IrrecoverableStreamException error

Thanks for the tutorial!

I'm having problems getting started. I've entered my tokens in keys.yaml in the main project directory but I keep getting IrrecoverableStreamException Error. The error message is below. (BTW keys.yaml refers to the tokens in a different way to the twitter developer site - I've tried different combinations though and still can't get it to work)

client_key: is this the same as the consumer key?
client_secret:
resource_owner_key: is this the same as the access token?
resource_owner_secret

Very grateful for your help!


In [25]: examples.track_keywords()
ERROR:root:Connection HTTP error 401
-----------------------------------------------------------------------
IrrecoverableStreamException          Traceback (most recent call last)
<ipython-input-25-e0f81e39dd68> in <module>()
----> 1 examples.track_keywords()

~/Desktop/DS/Twitter/twitterresearch/examples.py in track_keywords()
    159     keywords = ["politics", "election"]
    160     stream = streaming.stream(
--> 161         on_tweet=print_tweet, on_notification=print_notice, track=keywords)
    162 
    163 

~/Desktop/DS/Twitter/twitterresearch/streaming.py in stream(on_tweet, on_notification, track, follow)
    167         if stream.status_code != 200:
    168             stream.close()
--> 169             backoff(int(stream.status_code))
    170         try:
    171             for line in stream.iter_lines():

~/Desktop/DS/Twitter/twitterresearch/streaming.py in backoff(errorcode)
    117     elif errorcode in [401, 403, 404, 406, 413, 416]:
    118         logging.error(u"Connection HTTP error {0}".format(errorcode))
--> 119         raise IrrecoverableStreamException
    120     # We don't handle any other errors
    121     else:

IrrecoverableStreamException: 

New Error Arising from deduplicating input in database import

After the issue "Persistent Error in the use of import.json() #8" a new error emerged when trying to write data into a database. This time the function "deduplicate_lowercase(tags)" runs in an error:

187     valid = filter(None, l)
188     lowercase = [e.lower() for e in valid]
189     if len(valid) != len(lowercase):
190         logging.warning("The input file had {0} empty lines, skipping those. Please verify that it is complete and valid.".format(len(lowercase) - len(valid)))
191     deduplicated = list(set(valid))

TypeError: object of type 'filter' has no len()

Rest API keyword search and extraction

Dear developers,
it would be very helpful if you could include an opportunity to search the rest API after certain keywords and enable the user of your script package to also extract and save these results.

Youtube link seems to be the wrong video: has nothing to do with time zones

`Date and time are crucial to almost all kinds of data analysis, but at the same time they are notoriously difficult to handle. There are two major sources of confusion and errors, which correspond to the layers of modifications that are performed to get from an universal reference time to local time: (a) a timezone offset is added to a reference time to get local time and (b) if applicable, a daylight saving offset yields the seasonally correct time. (This presentation on date and time may be helpful: [http://www.youtube.com/watch?v=ZroB-e4RXmo]).
When working with twitter, it is crucial to keep two things in mind and separate: The Twitter API always returns date/time in UTC, the universal reference time (no timezone offset, no daylight saving time!). However, depending on your object of study, users will see and use their local time. In many cases, we need to convert the native UTC dates into some other format. Some best practices for handling this are:

  • Always store and/or explicitly declare the timezone information of your data
  • Convert at the latest possible opportunity`

retweet_links is slow

Currently, retweet_links in network.py is quite slow.
This might be both due to the costly (and non-optimal?) SQL query and the subsequent mapping/writing stages.

find_keyfile() not working

Hi, I'm having an issues running the examples part of twitter_auth.py - It always comes up as a StopIteration error when I'm attempting to 'import examples' to run the demonstrations on the Republican Debates

More specifically, when rest.py is imported and calls the authorize method from twitter_auth.py. [ i.e. auth = twitter_auth.authorize() ], it doesn't work for me. Putting the path of my local copy of keys.yaml as a string before (and in place of) the 'find_keyfile' function didn't work either. It is always a StopIteration error...

I don't understand because the code shouldn't even be reaching the next() operator.... very confused.

examples.save_track_keywords() write() argument error

Hi - Thanks for your help on the last issue. I have made a few steps forward but I am now having issues with the examples.save_track_keywords() function:

When run through IPython in the command line, I receive the pre-formatted error message suggested by the exception handling in the streaming.stream() :

In [2]: import examples
In [3]: examples.save_track_keywords()
ERROR:root:Error! Encountered Exception write() argument must be str, not dict b
ut continuing in order not to drop stream,

.... (1 for each tweet until keyboard interrupt)

ERROR:root:User stopped program, exiting!

KeyboardInterrupt

I'm assuming this is referring to the save_tweet() function .... I've tried tinkering with the format of the 'tweet' parameter to put it in string format but that hasn't worked, and neither has changing the write function to json.dump()

How to proceed?

Persistent Error in the use of import.json()

When using the function import.json() there appears a specific error repeatedly across various users. The function returns an error when trying to add data from json files to the database field "url". The error appears in the attempt to deduplicate the input and reads "NoneType" object has no attribute "lower".

As a quick fix, we simply stopped trying to fill the respective database field. Following this, the problem disappeared and the rest of the data loads perfectly. But as this hinders the analysis of hyperlinks in tweets it would be perfect if we would find a better solution.

examples.export_retweet_text() is throwing a peewee error. Unsure how to debug.

I'm going through the tutorial but the export_retweet_text function is throwing an error in peewee. As I'm not familiar with SQL syntax or the peewee library I wondered if someone could give me a hint on debugging or resolving this issue?

here's my input. All the other functions so far have worked fine.

examples.export_retweet_text()

and here's the error message I get:

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/peewee.py in get(self)
   3219         try:
-> 3220             return next(clone.execute())
   3221         except StopIteration:

playhouse/_speedups.pyx in playhouse._speedups._QueryResultWrapper.__next__()

playhouse/_speedups.pyx in playhouse._speedups._QueryResultWrapper.iterate()

StopIteration: 

During handling of the above exception, another exception occurred:

TweetDoesNotExist                         Traceback (most recent call last)
<ipython-input-31-1934918c595d> in <module>()
----> 1 examples.export_retweet_text()
      2 #retweet_text = pd.read_csv("retweet_text.csv")

~/Desktop/DS/Twitter/twitterresearch/examples.py in export_retweet_text(n)
    496         database.Tweet.retweet.is_null(False)).group_by(database.Tweet.retweet)
    497     for tweet in retweets:
--> 498         rt_counts[tweet.retweet.id] = tweet.retweet.retweets.count()
    499     from collections import Counter
    500     c = Counter(rt_counts)

~/anaconda3/lib/python3.6/site-packages/peewee.py in __get__(self, instance, instance_type)
   1384     def __get__(self, instance, instance_type=None):
   1385         if instance is not None:
-> 1386             return self.get_object_or_id(instance)
   1387         return self.field
   1388 

~/anaconda3/lib/python3.6/site-packages/peewee.py in get_object_or_id(self, instance)
   1375         if rel_id is not None or self.att_name in instance._obj_cache:
   1376             if self.att_name not in instance._obj_cache:
-> 1377                 obj = self.rel_model.get(self.field.to_field == rel_id)
   1378                 instance._obj_cache[self.att_name] = obj
   1379             return instance._obj_cache[self.att_name]

~/anaconda3/lib/python3.6/site-packages/peewee.py in get(cls, *query, **kwargs)
   4986         if kwargs:
   4987             sq = sq.filter(**kwargs)
-> 4988         return sq.get()
   4989 
   4990     @classmethod

~/anaconda3/lib/python3.6/site-packages/peewee.py in get(self)
   3222             raise self.model_class.DoesNotExist(
   3223                 'Instance matching query does not exist:\nSQL: %s\nPARAMS: %s'
-> 3224                 % self.sql())
   3225 
   3226     def peek(self, n=1):

TweetDoesNotExist: Instance matching query does not exist:
SQL: SELECT "t1"."id", "t1"."user_id", "t1"."text", "t1"."date", "t1"."language_id", "t1"."reply_to_user_id", "t1"."reply_to_tweet", "t1"."retweet_id" FROM "tweet" AS t1 WHERE ("t1"."id" = ?)
PARAMS: [0]

Setup on a Raspberry Pi 3

Dear everybody.

I visited a course at the university and came in contact with twitterresearch. Meanwhile I did some small analysis for the homework. For my master thesis I would like to collect all the tweets with #nobillag and #probillag. In Switzerland there is an voting about national media support on March 4th. I wanted to collect everything until then and this is not possible with my laptop. That's why I bought a Raspberry Pi 3 for that.

Do you know, why the same script does not work on the rpi that works on the macbook pro? I have ipython, did the preparation of the workspace as you tell to do in the tutorial.

Thank you for your help!

Here you find the nobillag.py file:
https://www.dropbox.com/s/mfenuty0rswpdly/nobillag.py?dl=0

error_rpi_twitterressearch

Error after one successful week on downloading

Dear all

It's me again with an urgent matter. I was successully downloading the tweets about the World Economic Forum in Davos with the keywords 'wef2018' and 'davos2018'.

After almost one week and approximately 1GB data the datas were not increasing anymore. I am using an ubuntu webserver and two other scripts are running aside on the same server. Is that the problem?

Does it have to do with the increasing amount of tweets during the WEF? I am happy if you can help me soon, because the WEF is in it's second day today.

Big thanx!

screen_error_wef2018

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.