docnow / twarc Goto Github PK
View Code? Open in Web Editor NEWA command line tool (and Python library) for archiving Twitter JSON
Home Page: https://twarc-project.readthedocs.io
License: MIT License
A command line tool (and Python library) for archiving Twitter JSON
Home Page: https://twarc-project.readthedocs.io
License: MIT License
I'm wondering if you can help me. I'm encountering a syntax error while attempting to install. I'm running Python 3.4 on Mac OSX 10.7.5.
Any help would be appreciated. Thanks!
Here is the log from the terminal:
vpn166047:~ tristandahn$ pip install twarc
Downloading/unpacking twarc
Downloading twarc-0.1.2.tar.gz
Running setup.py (path:/private/var/folders/v2/vv94gk4n6ns3s3ywf8bjvmw40000gn/T/pip_build_tristandahn/twarc/setup.py) egg_info for package twarc
Downloading/unpacking oauth2 (from twarc)
Downloading oauth2-1.5.211.tar.gz
Running setup.py (path:/private/var/folders/v2/vv94gk4n6ns3s3ywf8bjvmw40000gn/T/pip_build_tristandahn/oauth2/setup.py) egg_info for package oauth2
Traceback (most recent call last):
File "", line 17, in
File "/private/var/folders/v2/vv94gk4n6ns3s3ywf8bjvmw40000gn/T/pip_build_tristandahn/oauth2/setup.py", line 18
print "unable to find version in %s" % (VERSIONFILE,)
^
SyntaxError: invalid syntax
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 17, in
File "/private/var/folders/v2/vv94gk4n6ns3s3ywf8bjvmw40000gn/T/pip_build_tristandahn/oauth2/setup.py", line 18
print "unable to find version in %s" % (VERSIONFILE,)
^
SyntaxError: invalid syntax
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /private/var/folders/v2/vv94gk4n6ns3s3ywf8bjvmw40000gn/T/pip_build_tristandahn/oauth2
Storing debug log for failure in /Users/tristandahn/.pip/pip.log
Travis CI has a ~10 minutes timeout for builds, which is often hit for twarc:
https://travis-ci.org/edsu/twarc/builds
https://travis-ci.org/edsu/twarc/jobs/44435453
============================= test session starts ==============================
platform linux2 -- Python 2.6.9 -- py-1.4.26 -- pytest-2.6.4
collected 6 items
test.py .....
No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.
The build has been terminated
Any ideas what's causing this?
Is this something that might be useful for twarc?
Some users would like to have the archive.py utility as part of the pip install of twarc.
There is an official bug since a couple of week (Twitter API calls on all major platform but their apps too) get frequently 404 with unknown state. It's discussed too here:
https://twittercommunity.com/t/intermittent-404-responses-from-rest-api/46712
Despite apparent resolution by Twitter it started to raise again.
Probably users of Twarc too got it recently (requests.exceptions.HTTPError: 404 Client Error: Not Found), it looks like this:
2015-07-18 16:48:49,156 INFO archived 622089779308593152
Traceback (most recent call last):
File "../../twarc/utils/archive.py", line 141, in <module>
main()
File "../../twarc/utils/archive.py", line 107, in main
for tweet in tweets:
File "build/bdist.macosx-10.6-intel/egg/twarc.py", line 235, in search
File "build/bdist.macosx-10.6-intel/egg/twarc.py", line 188, in new_f
return None
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py", line 851, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found
If it happens retry your search again till you you get it without errors. It's not a permanent failure but an intermittent failure by Twitter.
@edsu proposed to introduce a catch&retry. I share this because my opinion is that it'll happen in the future too since they killed the Whale.
Hello,
I am using your script for school and I noticed when running it as library for streaming, it sometimes (one per minute) raise error (ERROR:root:json parse error: No JSON object could be decoded - ). Can I ask you what to do with that to make it work properly?
Thanks a lot.
Filip Hadac
Instead of trying to load different config modules for different versions twarc should use six.
It would be useful if twarc could be used in hydrate mode. Since [Twitter's ToS] frown on sharing the bulk JSON people tend to share Tweet IDs that need to be hydrated by going back to the Twitter API.
It would be handy if twarc had a notion of profiles, similar to amazon's aws-cli. This way the various keys could be saved in a file, and there could be a default.
Wwhen started without any keys twarc could prompt for the consumer keys, give the user a URL to visit in their browser to get the access keys, and then save all the keys in their default profile ($HOME/.twarcrc).
It stops after limited results, doesn't mather if with or without --scrape.
Debug logs do not contains any error. Twarc.py exit saying something like this:
2014-09-05 11:44:09,695 INFO no new tweets with id < 505020989466771457
But comparing with previous results it miss a lot of results.
How to debug deeply?
--query has the ability to stop where it left off. However, it was not working for me. In testing, this was not a problem when the filename only had alphanumeric characters, but when I used a query that included punctuation, in this case a hash, the functionality would not work.
This is because the name of the files are based on the string that is sent to twitter and the string that is sent to twitter has the quote() function performed on it first. So if query q = "#somekeyword", then the query that is sent to twitter is "%23somekeyword".
However, in the last_archive function, when --query looks up previous filenames for a stop ID, it is matching the filename with q prior to having the quote() function performed on it. It's matching q which is "#somekeyword" with a file that begins with "%23somekeyword".
I'm not sure this is either the best solution, or if it is, the best way to implement this solution, but in the last_archive function, matching quote(q, safe='') with the filename instead of just q on 218 fixed the problem.
see below:
def last_archive(q):
other_archive_files = []
for filename in os.listdir("."):
if re.match("^%s-\d+\.json$" % quote(q, safe=''), filename):
other_archive_files.append(filename)
other_archive_files.sort()
while len(other_archive_files) != 0:
f = other_archive_files.pop()
if os.path.getsize(f) > 0:
return f
return None
I think it would simplify the code quite a bit if twarc simply wrote tweets to stdout and let the user decide what file they should go.
When run repeatedly twarc tries to determine the since_id to use when talking to the Twitter API based on data that has already been archived. But this functionality is dependent on twarc being run in the same directory as the other archive files, and the filenames matching a particular pattern (which can get ugly). The determination of the since_id isn't working properly with files created with --stream since they are ordered differently.
I propose this logic is removed and we add a --min_id option to match --max_id. The user can then control what they want to do, and where the data goes.
What would be the best way to archive a single user's tweets?
One way is to search for their @ username, and then filter by creator, but is there a cleaner way?
19:32 <anarchivist> edsu: so, how often should i run twarc? if there are no
new tweets, it writes a 0 byte file, and does a new fetch
of everything the next time it gets called.
It would be nice if it were possible to have twarc read a stream of ids from stdin rather than requiring them to be in an uncompressed file:
zcat ids.txt.gz | twarc.py --hydrate | gzip - > tweets.json.tz
This would make it similar to the various scripts in utils.
At first start of twarc --scrape, and after fetching for a while, it stops with this error looping to fetch an IDs.
2014-10-07 16:35:49,958 ERROR unable to fetch https://api.twitter.com/1.1/statuses/show.json?id=519514618390392832 - too many tries!
not at next relaunch. If you drop all json, with same query it repait all fails.
It looks like it get a missing IDs to fetch.
During high volume events Twitter's streaming API can send warnings when you are falling behind. It would be useful to ask for these, log them, and act accordingly. It might be important to decouple json parsing from writing the data so that things move faster. Also, maybe we could accept gzip from the API?
We also need to make sure that we guard against stalls, which are periods of > 90 seconds if no data is received.
I'm getting this error while trying to remove RTs from a json file:
File "twarc/utils/noretweets.py", line 17
if not 'retweeted_status' in tweet
^
SyntaxError: invalid syntax
I used it since a couple of week, but after I did a git pull a few minutes ago I broke it.
Clean, upgrade, solved requirements, uninstall, reinstall using ENVs or config.py doesn't solve anything.
I get this every time:
Traceback (most recent call last):
File "./twarc.py", line 256, in <module>
archive(args.query, search(args.query, since_id=since_id, max_id=args.max_id, scrape=args.scrape))
File "./twarc.py", line 180, in archive
for status in statuses:
File "./twarc.py", line 113, in search
results, max_id = search_result(q, since_id, max_id)
File "./twarc.py", line 127, in search_result
client = TwitterClient()
File "./twarc.py", line 37, in __init__
self.ping()
File "./twarc.py", line 100, in ping
self.reset = int(response["x-rate-limit-reset"])
KeyError: 'x-rate-limit-reset'
It appears that --scrape isn't grabbing everything that is available on the search screen. For example, I'm trying to grab everything from this query, and this is what I am receiving. Log is here. It looks like it is not grabbing anything before March.
I'm more than happy to try and hack on this. Just need some advice where to start 😉 Any thoughts?
Hello Team,
Can I achieve this ?
I want All Followers of "Any Account" and store in File
Please help
It's been a few months since I used Twarc and upon returning to it, I see that the feature of "picking up where you left off" has been moved to archive.py. If I'm understanding the reasoning behind this, it's that moving it elsewhere leaves room for people to send their data where they want to.
That makes sense, but I cannot think of an instance in which I would not want this feature. If the program shuts down for any reason, you will need this feature. Why would I use the --search option given that all the features of search are contained in archive.py with the added benefit of being able to start where I left off should the program fail for some reason. It should be noted, I do not think of all things, so reasons may exist.
My limited experience aside, the only other places I think one would want to write data is to either a database or a program that does something with the data first and then passes it on. While doing so might be preferred to a single json file, what I love about twarc is it's simplicity. It writes to a json file, and you can manage the json file afterwards. Using twarc in any other capacity would take enough re-writing on the part of the user to make it work that it's not worth losing this as a default feature. I think it would be preferable to have a flag that sets the output to a different location than it would be to assume that you want it to go to a different location by default.
Hi Ed,
So I got consumer secret etc from https://dev.twitter.com/apps/new correctly set up. I assumed that I just put these into twarc.py at lines 33-36, but when I run it, I still get the error message from lines 39-41. So, not being overly familiar with python, can you talk me through putting these into my environment? Mac, PC, as I'd like to get my students exploring the possibilities and I have to be able to talk them through both.
Thanks!
It should be possible to run twarc in stream mode:
https://dev.twitter.com/streaming/reference/post/statuses/filter
Twitter's ToS pretty explicitly forbid what the --scrape option does. I'm removing this functionality so we play nicely:
access or search or attempt to access or search the Services by any means (automated or otherwise) other than through our currently available, published interfaces that are provided by Twitter (and only pursuant to those terms and conditions), unless you have been specifically allowed to do so in a separate agreement with Twitter (NOTE: crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited);
Plus it never seemed to work terribly well, and complicated the code a bit.
I don't know if the API has changed but It seems like when twarc reaches the end of results it repeats lookups for the last tweet it found. So you end up seeing something like this in your log.
2015-04-14 21:40:28,978 INFO archived 585331966372872192
2015-04-14 21:40:29,271 INFO archived 585331966372872192
2015-04-14 21:40:29,729 INFO archived 585331966372872192
2015-04-14 21:40:30,066 INFO archived 585331966372872192
2015-04-14 21:40:30,363 INFO archived 585331966372872192
2015-04-14 21:40:30,453 INFO archived 585331966372872192
2015-04-14 21:40:30,543 INFO archived 585331966372872192
2015-04-14 21:40:30,628 INFO archived 585331966372872192
2015-04-14 21:40:30,767 INFO archived 585331966372872192
2015-04-14 21:40:30,886 INFO archived 585331966372872192
2015-04-14 21:40:30,985 INFO archived 585331966372872192
2015-04-14 21:40:31,057 INFO archived 585331966372872192
2015-04-14 21:40:31,148 INFO archived 585331966372872192
2015-04-14 21:40:31,239 INFO archived 585331966372872192
2015-04-14 21:40:31,391 INFO archived 585331966372872192
2015-04-14 21:40:31,545 INFO archived 585331966372872192
It happens since 24h, did I get a new return code by Twitter API?
twarc.py "#AnyHashtagsWithLotsOfTweets"
Traceback (most recent call last):
File "/usr/local/bin/twarc.py", line 5, in <module>
pkg_resources.run_script('twarc==0.0.7', 'twarc.py')
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 499, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1235, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/twarc-0.0.7-py2.7.egg/EGG-INFO/scripts/twarc.py", line 310, in <module>
archive(args.query, tweets)
File "/usr/local/lib/python2.7/dist-packages/twarc-0.0.7-py2.7.egg/EGG-INFO/scripts/twarc.py", line 214, in archive
for status in statuses:
File "/usr/local/lib/python2.7/dist-packages/twarc-0.0.7-py2.7.egg/EGG-INFO/scripts/twarc.py", line 128, in search
results, max_id = search_result(q, since_id, max_id)
File "/usr/local/lib/python2.7/dist-packages/twarc-0.0.7-py2.7.egg/EGG-INFO/scripts/twarc.py", line 161, in search_result
client = TwitterClient()
File "/usr/local/lib/python2.7/dist-packages/twarc-0.0.7-py2.7.egg/EGG-INFO/scripts/twarc.py", line 43, in __init__
self.ping()
File "/usr/local/lib/python2.7/dist-packages/twarc-0.0.7-py2.7.egg/EGG-INFO/scripts/twarc.py", line 115, in ping
self.reset = int(response.headers["x-rate-limit-reset"])
File "/usr/local/lib/python2.7/dist-packages/requests/structures.py", line 54, in __getitem__
return self._store[key.lower()][1]
KeyError: 'x-rate-limit-reset'
Not sure it's related with #55 because I hydrate a previous twarc session in analyzing series with lots of spam. I started with a 28.141 tweets's ids and obtaining 27.641 tweets. Lost 500 but I found 5 dropped connection. Looking at numbers looks like getting an error the entire set of 100 are lost.
2015-02-20 12:48:18,320 INFO Starting new HTTPS connection (1): api.twitter.com
2015-02-20 12:48:32,347 INFO Resetting dropped connection: api.twitter.com
2015-02-20 12:48:47,241 INFO Resetting dropped connection: api.twitter.com
2015-02-20 12:49:02,603 INFO Resetting dropped connection: api.twitter.com
2015-02-20 13:03:33,392 INFO Resetting dropped connection: api.twitter.com
2015-02-20 13:03:48,445 INFO Resetting dropped connection: api.twitter.com
I repeated the hydrate a few hours later getting the same 5 dropped connection at the same point of the previous one (checked previous and next ids's tweet too).
Could it be that it's kinda new API's moderation?
My use case involves relatively small sets of tweets (~10k), from which I want to extract data to feed into various D3 visualizations: timelines, graphs, etc. I'll therefore be putting in a couple of PRs shortly (I hope), but some of this will stretch my python skills to the limit or beyond. I've got some work under way for parts of it but I don't lay claim to any of it.
Comments and suggestions (and code!) welcome.
Saw this on a long running stream process after millions of tweets were archived:
(twarc)ubuntu@ip-10-39-110-115:/mnt/iran/data$ twarc.py --stream "Iran,Иран,Իրան,ﺈﻳﺭﺎﻧ,איראן,İran,ईरान,ইরান,Эрон,อิ อิหร่าน,इरान,イ ,이란,Іран" | gzip - > stream5.json.gz
Traceback (most recent call last):
File "/home/ubuntu/.virtualenvs/twarc/bin/twarc.py", line 8, in <module>
execfile(__file__)
File "/home/ubuntu/twarc/twarc.py", line 228, in <module>
main()
File "/home/ubuntu/twarc/twarc.py", line 77, in main
for tweet in tweets:
File "/home/ubuntu/twarc/twarc.py", line 181, in stream
for line in resp.iter_lines(chunk_size=512):
File "/home/ubuntu/.virtualenvs/twarc/local/lib/python2.7/site-packages/requests/models.py", line 663, in iter_lines
for chunk in self.iter_content(chunk_size=chunk_size, decode_unicode=decode_unicode):
File "/home/ubuntu/.virtualenvs/twarc/local/lib/python2.7/site-packages/requests/models.py", line 630, in generate
raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: IncompleteRead(333 bytes read)
Probably need a try/except block to catch stuff like this when it happens.
It would be useful if twarc, or some similar command line tool, could only emit the tweet ids for a particular query. Apparently this is a popular-ish way for researchers to share twitter data sets without worrying about the ability to republish twitter data.
Since --query was added in f6ccea9 the examples in the README need updating.
When the volume on the Twitter filter stream gets too high they cannot deliver all tweets. We should log this information when Twitter provides it.
It would be nice if archive.py reported a friendly error message when not all arguments are supplied, instead of this:
Natalies-MacBook-Air:Library nataliebaur$ utils/archive.py
-bash: utils/archive.py: No such file or directory
Natalies-MacBook-Air:Library nataliebaur$ utils/archive.py semanaNT /mnt/semanaNT/semanaNT
-bash: utils/archive.py: No such file or directory
Natalies-MacBook-Air:Library nataliebaur$
I'm testing current release with a new hashtags: used only yesterday on twitter. I get different quantities with same query, repeating it every 30min, both with and without --scrape.
With summarize.py I verify that also timing differ.
The ping method in twarc.py works for search but appears not to work during hydration. I think search/tweets
rate limits are different from statuses/lookup
. This results in hydrate returning tweets that are just the string "errors" when the API response is a JSON document like this:
{
"errors": { ... }
}
After running twarc in stream mode for a few days it died with this error:
(twarc)ubuntu@ip-10-39-110-115:/mnt/cuba$ ~/twarc/twarc.py --stream --query "#cuba,#cubapolicy"
Traceback (most recent call last):
File "/home/ubuntu/twarc/twarc.py", line 368, in <module>
File "/home/ubuntu/twarc/twarc.py", line 241, in archive
"""
KeyError: 'id_str
I have daily runs of lots of keywords & hashtags dump, but all runs querying for simple keywords results contains only last 8-12 days of twitter. I tested it too with Debian and OSX.
In my case happens with all searches.
In the past months I was able to get both: deltas and entire history. Since a few days, and before last release too, I was no more able to get entire deltas (if the batch didn't run since a while). Neither entire history.
Some of this daily are to check & compare history in verifying if releases or twitter is working fine.
There are no errors so at first I thought again in Twitter search API issues, anyone else ?
The avatar images can change, or be deleted. wall.py
should really pull them down into an images directory, and adjust the img src as appropriate in the HTML.
It happens with any parameters:
~# ../../twarc/utils/summarize.py noscrape/*
noscrape/%23GTAC2014-20141028000122.json
start: 526141358911021056 [Sat Oct 25 22:40:49 +0000 2014]
end: 526884008718659584 [Mon Oct 27 23:51:51 +0000 2014]
total: 29
noscrape/%23GTAC2014-20141028140453.json
start: 526141358911021056 [Sat Oct 25 22:40:49 +0000 2014]
end: 527095935399366656 [Tue Oct 28 13:53:58 +0000 2014]
total: 36
~# ../../twarc/utils/summarize.py scrape/*
scrape/%23GTAC2014-20141027235924.json
start: 489167543257403392 [Tue Jul 15 22:00:05 +0000 2014]
end: 526884008718659584 [Mon Oct 27 23:51:51 +0000 2014]
total: 95
scrape/%23GTAC2014-20141028140507.json
start: 489167543257403392 [Tue Jul 15 22:00:05 +0000 2014]
end: 527095935399366656 [Tue Oct 28 13:53:58 +0000 2014]
total: 104
I got this for a while on some Debian boxes, now the same on my clean OSX box. At first run, it takes time but it will stop like if requests package isn't installed. In this case when still remains 119 API attempt. All next run work fine, but first fail limits tweets results.
It happens with and without --scrape. The example is with lots of tweets, if results are limited seems working fine.
$ twarc.py --scrape "#moncler #report"
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/2.7/bin/twarc.py", line 283, in <module>
archive(args.query, tweets)
File "/Library/Frameworks/Python.framework/Versions/2.7/bin/twarc.py", line 197, in archive
for status in statuses:
File "/Library/Frameworks/Python.framework/Versions/2.7/bin/twarc.py", line 123, in search
for status in scrape_tweets(q, max_id=max_id):
File "/Library/Frameworks/Python.framework/Versions/2.7/bin/twarc.py", line 210, in scrape_tweets
for tweet_id in scrape_tweet_ids(query, max_id, sleep=1):
File "/Library/Frameworks/Python.framework/Versions/2.7/bin/twarc.py", line 233, in scrape_tweet_ids
r = requests.get(url, params=q)
NameError: global name 'requests' is not defined
The example is getting tweets, so you could relaunch it. It will work. Drop all *.json and starting again it fails in the same way.
I am unable to get utils/archive.py to work anymore. I am able to replicate this on two machines and will continue to check to ensure this is pilot error. In the meantime.
Run a twarc.py --search yoursearch > tweets.json
The utils/archive.py yoursearch /path/to/save
produces (in my case)
Traceback (most recent call last):
File "twarc/utils/archive.py", line 110, in
main()
File "twarc/utils/archive.py", line 62, in main
t = twarc.Twarc()
TypeError: init() takes exactly 5 arguments (1 given)
Cheers,
./fxk
In the documentation on the front page of the repo, it uses the flag --rehydrate, but I believe it should by --hydrate. --rehydrate does not work for me, and as far as I can tell, the flag in the code is --hydrate.
Would it be useful to collect from the sample stream if -stream
is used with no argument?
no errors in logs. Same hashtag, frequently, I get older without --scrape but with less results if adding --scrape:
without --scrape
start: 518128367799787520 [Fri Oct 03 20:00:03 +0000 2014]
end: 521062912107216897 [Sat Oct 11 22:20:53 +0000 2014]
total: 5540
with --scrape
start: 520990845244563457 [Sat Oct 11 17:34:31 +0000 2014]
end: 521058075718189056 [Sat Oct 11 22:19:40 +0000 2014]
total: 5709
Do I miss how differ "sync points"?
I confirmed a collection's validity with validate.py
prior to running unshorten.py
. After running unshorten.py
on a collection, and checking it's validity with validate.py
, I gets lots of errors.
Sample:
uhoh, we got a problem on line: 9376769
No JSON object could be decoded
uhoh, we got a problem on line: 9413314
No JSON object could be decoded
uhoh, we got a problem on line: 9457029
No JSON object could be decoded
uhoh, we got a problem on line: 9470191
No JSON object could be decoded
uhoh, we got a problem on line: 9474397
No JSON object could be decoded
uhoh, we got a problem on line: 9500591
No JSON object could be decoded
uhoh, we got a problem on line: 9506738
No JSON object could be decoded
uhoh, we got a problem on line: 9517267
No JSON object could be decoded
uhoh, we got a problem on line: 9545542
No JSON object could be decoded
uhoh, we got a problem on line: 9567288
No JSON object could be decoded
uhoh, we got a problem on line: 9632298
No JSON object could be decoded
uhoh, we got a problem on line: 9676049
No JSON object could be decoded
uhoh, we got a problem on line: 9689651
No JSON object could be decoded
uhoh, we got a problem on line: 9761634
No JSON object could be decoded
uhoh, we got a problem on line: 9773360
No JSON object could be decoded
uhoh, we got a problem on line: 9943500
No JSON object could be decoded
uhoh, we got a problem on line: 9967734
No JSON object could be decoded
uhoh, we got a problem on line: 10024047
No JSON object could be decoded
uhoh, we got a problem on line: 10063945
No JSON object could be decoded
The invalid json then prevents me from getting a list of the top urls in a collection with urls.py
because there are many invalid json objects in the file.
$ cat JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-unshortened-urls-20150129.json | ~/git/twarc/utils/urls.py | sort | uniq -c | sort -n > JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-urls-20150129.txt
Traceback (most recent call last):
File "/home/nruest/git/twarc/utils/urls.py", line 11, in <module>
tweet = json.loads(line)
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
Is it trivial to have unshorten.py
write back valid json?
It's a few weeks that very often Archive.py and Twarc --Hydrate failes unnoticed when launching with &. It doesn't write any traceback neither forwarding output to files. But launching both interactively I get this:
Traceback (most recent call last):
File "/usr/local/bin/twarc.py", line 4, in <module>
__import__('pkg_resources').run_script('twarc==0.3.0', 'twarc.py')
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 729, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1649, in run_script
exec(script_code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/twarc-0.3.0-py2.7.egg/EGG-INFO/scripts/twarc.py", line 335, in <module>
File "/usr/local/lib/python2.7/dist-packages/twarc-0.3.0-py2.7.egg/EGG-INFO/scripts/twarc.py", line 109, in main
File "/usr/local/lib/python2.7/dist-packages/twarc-0.3.0-py2.7.egg/EGG-INFO/scripts/twarc.py", line 298, in hydrate
File "/usr/local/lib/python2.7/dist-packages/twarc-0.3.0-py2.7.egg/EGG-INFO/scripts/twarc.py", line 172, in new_f
File "/usr/local/lib/python2.7/dist-packages/twarc-0.3.0-py2.7.egg/EGG-INFO/scripts/twarc.py", line 323, in post
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 507, in post
return self.request('POST', url, data=data, json=json, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 464, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 576, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 370, in send
timeout=timeout
File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 544, in urlopen
body=body, headers=headers)
File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 372, in _make_request
httplib_response = conn.getresponse(buffering=True)
File "/usr/lib/python2.7/httplib.py", line 1034, in getresponse
response.begin()
File "/usr/lib/python2.7/httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
line = self.fp.readline()
File "/usr/lib/python2.7/socket.py", line 447, in readline
data = self._sock.recv(self._rbufsize)
File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/contrib/pyopenssl.py", line 188, in recv
data = self.connection.recv(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pyOpenSSL-0.14-py2.7.egg/OpenSSL/SSL.py", line 995, in recv
self._raise_ssl_error(self._ssl, result)
File "/usr/local/lib/python2.7/dist-packages/pyOpenSSL-0.14-py2.7.egg/OpenSSL/SSL.py", line 862, in _raise_ssl_error
raise SysCallError(errno, errorcode[errno])
OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
The systems used have no I/O issues or network issues, all debians & OS X full updated and Twarc is current v0.3.0. How can I help in finding any solution?
Should you be able to pass the twitter keys on the command line, and to the Twarc constructor?
Either push load_config functionality down into Twarc constructor, or make it into a function that can easily be called from elsewhere. This way tests and other utilities can use it too.
After running twarc for two days I analyzed the output and found that it downloads the same tweets over and over again. The script should hold a set of known tweet ids and only emit tweets that have not been written before.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.