docnow / twarc Goto Github PK

A command line tool (and Python library) for archiving Twitter JSON

Home Page: https://twarc-project.readthedocs.io

License: MIT License

Python 100.00%

twarc's Introduction

docnow

The web is a big and rapidly changing place, so it can be challenging to discover what resources related to a particular event or topic are in need of archiving. Appraisal is an umbrella term for the many processes by which archivists identify records of enduring value for preservation in an archive. DocNow is an appraisal tool for the social web that uses Twitter.

DocNow allows archivists to tap into conversations in Twitter to help them discover what web resources for collection and preservation. It also connects archivists with content creators in order to make the process of archving web content more collaborative and consentful. The purpose of DocNow is to help ensure ethical practices in web archiving by building conversations between archivists and the communities they are documenting.

The DocNow application has been developed with generous support from the Mellon Foundation.

Architecture

This repository houses the complete DocNow application which is comprised of a few components:

a client side application (React)
a server side REST API (Node)
a database (PostgreSQL)
a messaging queue database (Redis)

Production

If you are running DocNow in production you will want to check out docnow-ansible which allows you to provision and configure DocNow in the cloud.

Development

The main branch of this repository represents the latest tested features of the DocNow application following the trunk based development model. Tagged version releases can be used for production deployments. Development usually happens on short lived branches which are merged into main once they have been reviewed and approved. If you'd like to contribute to the DocNow project please fork this repository, create a branch for your feature or bug fix, and then send a pull request to have it reviewed and merged.

To set up DocNow locally on your workstation you will need to install Git and Docker. Once you've got them installed open a terminal window and follow these instructions:

git clone https://github.com/docnow/docnow
cd docnow
cp .env.dev .env
docker-compose build --no-cache
docker-compose up
make some ☕️
open http://localhost:3000

If you run into an error above and want to clean out all your docker containers and images you can run this:

sh clean-up.sh

Testing

The test suite runs automatically via a GitHub Action. If you want to run the tests yourself you will need to:

cp .env.test-sample .env.test

Replace the CHANGE_ME values in .env.test to the respective Twitter API credentials. Then run the tests.

npm run test

Do not commit .env.test to git since it contains your Twitter API keys!

twarc's People

Contributors

Stargazers

Watchers

Forkers

no-reply lsblakk imclab pbinkley ruebot narogers bibliotechy baojie mgthesis mjgiarlo copystar phette23 anjackson anarchivist umd-mith hugovk gitter-badger shawngraham ryanpickering dchud purplewove caraoz epocolis kevinbgunn miku ameenkhan07 carlesm lizrodrigues lbjay latuji sagnik88 eolienne shirish93 irfreitas msdatascience tanych jeffreymoro alexgleith gwu-libraries chosak justinlittman dimazest uvicmakerlab totuta antonini hc10024 muranava project-renard-survey jdrew1303 mp285 py3in olivierh59500 cazzerson joebobhester nwoodward paulgb aejolene yshussain jh4909a jaygattuso gmcharlt jimgoo ianmilligan1 shintakezou vdenberg willidea12 shenglih srikanth-git debanjanghosh kant appleblossom457 nichworby daniel-gallagher lwforpres ghuntley simonb83 cyfork rakib062 whitni baobei347 digilabhh ppival leezu shreethorat fork-for-review tomerk melaniewalsh jessetg kayiwa tinafigueroa jesswhyte lachlandeer turmudi dannyjkj roukdanus filamarisol hcpenguin cy-b zhiaozhou zhanglipku

twarc's Issues

add --stream option

It should be possible to run twarc in stream mode:

https://dev.twitter.com/streaming/reference/post/statuses/filter

load_config

Either push load_config functionality down into Twarc constructor, or make it into a function that can easily be called from elsewhere. This way tests and other utilities can use it too.

Archive & Hydrate failures getting OpenSSL.SSL.SysCallError: (104, 'ECONNRESET') by Twitter

It's a few weeks that very often Archive.py and Twarc --Hydrate failes unnoticed when launching with &. It doesn't write any traceback neither forwarding output to files. But launching both interactively I get this:

Traceback (most recent call last):
  File "/usr/local/bin/twarc.py", line 4, in <module>
    __import__('pkg_resources').run_script('twarc==0.3.0', 'twarc.py')
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 729, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1649, in run_script
    exec(script_code, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/twarc-0.3.0-py2.7.egg/EGG-INFO/scripts/twarc.py", line 335, in <module>

  File "/usr/local/lib/python2.7/dist-packages/twarc-0.3.0-py2.7.egg/EGG-INFO/scripts/twarc.py", line 109, in main

  File "/usr/local/lib/python2.7/dist-packages/twarc-0.3.0-py2.7.egg/EGG-INFO/scripts/twarc.py", line 298, in hydrate

  File "/usr/local/lib/python2.7/dist-packages/twarc-0.3.0-py2.7.egg/EGG-INFO/scripts/twarc.py", line 172, in new_f

  File "/usr/local/lib/python2.7/dist-packages/twarc-0.3.0-py2.7.egg/EGG-INFO/scripts/twarc.py", line 323, in post

  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 507, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 464, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 576, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 370, in send
    timeout=timeout
  File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 544, in urlopen
    body=body, headers=headers)
  File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/connectionpool.py", line 372, in _make_request
    httplib_response = conn.getresponse(buffering=True)
  File "/usr/lib/python2.7/httplib.py", line 1034, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 365, in _read_status
    line = self.fp.readline()
  File "/usr/lib/python2.7/socket.py", line 447, in readline
    data = self._sock.recv(self._rbufsize)
  File "/usr/local/lib/python2.7/dist-packages/requests/packages/urllib3/contrib/pyopenssl.py", line 188, in recv
    data = self.connection.recv(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/pyOpenSSL-0.14-py2.7.egg/OpenSSL/SSL.py", line 995, in recv
    self._raise_ssl_error(self._ssl, result)
  File "/usr/local/lib/python2.7/dist-packages/pyOpenSSL-0.14-py2.7.egg/OpenSSL/SSL.py", line 862, in _raise_ssl_error
    raise SysCallError(errno, errorcode[errno])
OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')

The systems used have no I/O issues or network issues, all debians & OS X full updated and Twarc is current v0.3.0. How can I help in finding any solution?

unable to fetch more than 100 tweets per run

It stops after limited results, doesn't mather if with or without --scrape.
Debug logs do not contains any error. Twarc.py exit saying something like this:

2014-09-05 11:44:09,695 INFO no new tweets with id < 505020989466771457

But comparing with previous results it miss a lot of results.
How to debug deeply?

Syntax Error During Installation (Python3)

I'm wondering if you can help me. I'm encountering a syntax error while attempting to install. I'm running Python 3.4 on Mac OSX 10.7.5.

Any help would be appreciated. Thanks!

Here is the log from the terminal:

vpn166047:~ tristandahn$ pip install twarc
Downloading/unpacking twarc
Downloading twarc-0.1.2.tar.gz
Running setup.py (path:/private/var/folders/v2/vv94gk4n6ns3s3ywf8bjvmw40000gn/T/pip_build_tristandahn/twarc/setup.py) egg_info for package twarc

Downloading/unpacking oauth2 (from twarc)
Downloading oauth2-1.5.211.tar.gz
Running setup.py (path:/private/var/folders/v2/vv94gk4n6ns3s3ywf8bjvmw40000gn/T/pip_build_tristandahn/oauth2/setup.py) egg_info for package oauth2
Traceback (most recent call last):
File "", line 17, in
File "/private/var/folders/v2/vv94gk4n6ns3s3ywf8bjvmw40000gn/T/pip_build_tristandahn/oauth2/setup.py", line 18
print "unable to find version in %s" % (VERSIONFILE,)
^
SyntaxError: invalid syntax
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

File "", line 17, in

File "/private/var/folders/v2/vv94gk4n6ns3s3ywf8bjvmw40000gn/T/pip_build_tristandahn/oauth2/setup.py", line 18

print "unable to find version in %s" % (VERSIONFILE,)

                                   ^

SyntaxError: invalid syntax

Cleaning up...
Command python setup.py egg_info failed with error code 1 in /private/var/folders/v2/vv94gk4n6ns3s3ywf8bjvmw40000gn/T/pip_build_tristandahn/oauth2
Storing debug log for failure in /Users/tristandahn/.pip/pip.log

syntax error noretweets.py

I'm getting this error while trying to remove RTs from a json file:

 File "twarc/utils/noretweets.py", line 17
    if not 'retweeted_status' in tweet
                                     ^
SyntaxError: invalid syntax

scrape mode should not duplicate tweets

After running twarc for two days I analyzed the output and found that it downloads the same tweets over and over again. The script should hold a set of known tweet ids and only emit tweets that have not been written before.

falling behind & stalls during streaming

During high volume events Twitter's streaming API can send warnings when you are falling behind. It would be useful to ask for these, log them, and act accordingly. It might be important to decouple json parsing from writing the data so that things move faster. Also, maybe we could accept gzip from the API?

We also need to make sure that we guard against stalls, which are periods of > 90 seconds if no data is received.

secrets and keys

Hi Ed,

So I got consumer secret etc from https://dev.twitter.com/apps/new correctly set up. I assumed that I just put these into twarc.py at lines 33-36, but when I run it, I still get the error message from lines 39-41. So, not being overly familiar with python, can you talk me through putting these into my environment? Mac, PC, as I'd like to get my students exploring the possibilities and I have to be able to talk them through both.

Thanks!

Get All Followers of other Account and store in File

Hello Team,

Can I achieve this ?

I want All Followers of "Any Account" and store in File

Please help

What if after running for a while I get "KeyError: 'x-rate-limit-reset'"

It happens since 24h, did I get a new return code by Twitter API?

twarc.py "#AnyHashtagsWithLotsOfTweets"
Traceback (most recent call last):
  File "/usr/local/bin/twarc.py", line 5, in <module>
    pkg_resources.run_script('twarc==0.0.7', 'twarc.py')
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1235, in run_script
    execfile(script_filename, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/twarc-0.0.7-py2.7.egg/EGG-INFO/scripts/twarc.py", line 310, in <module>
    archive(args.query, tweets)
  File "/usr/local/lib/python2.7/dist-packages/twarc-0.0.7-py2.7.egg/EGG-INFO/scripts/twarc.py", line 214, in archive
    for status in statuses:
  File "/usr/local/lib/python2.7/dist-packages/twarc-0.0.7-py2.7.egg/EGG-INFO/scripts/twarc.py", line 128, in search
    results, max_id = search_result(q, since_id, max_id)
  File "/usr/local/lib/python2.7/dist-packages/twarc-0.0.7-py2.7.egg/EGG-INFO/scripts/twarc.py", line 161, in search_result
    client = TwitterClient()
  File "/usr/local/lib/python2.7/dist-packages/twarc-0.0.7-py2.7.egg/EGG-INFO/scripts/twarc.py", line 43, in __init__
    self.ping()
  File "/usr/local/lib/python2.7/dist-packages/twarc-0.0.7-py2.7.egg/EGG-INFO/scripts/twarc.py", line 115, in ping
    self.reset = int(response.headers["x-rate-limit-reset"])
  File "/usr/local/lib/python2.7/dist-packages/requests/structures.py", line 54, in __getitem__
    return self._store[key.lower()][1]
KeyError: 'x-rate-limit-reset'

archive.py & pip

Some users would like to have the archive.py utility as part of the pip install of twarc.

archive avatar images

The avatar images can change, or be deleted. wall.py should really pull them down into an images directory, and adjust the img src as appropriate in the HTML.

hydrate

It would be useful if twarc could be used in hydrate mode. Since [Twitter's ToS] frown on sharing the bulk JSON people tend to share Tweet IDs that need to be hydrated by going back to the Twitter API.

sample stream

Would it be useful to collect from the sample stream if -stream is used with no argument?

ping during hydration

The ping method in twarc.py works for search but appears not to work during hydration. I think search/tweets rate limits are different from statuses/lookup. This results in hydrate returning tweets that are just the string "errors" when the API response is a JSON document like this:

{
  "errors": { ... }
}

search doesn't complete

I don't know if the API has changed but It seems like when twarc reaches the end of results it repeats lookups for the last tweet it found. So you end up seeing something like this in your log.

2015-04-14 21:40:28,978 INFO archived 585331966372872192
2015-04-14 21:40:29,271 INFO archived 585331966372872192
2015-04-14 21:40:29,729 INFO archived 585331966372872192
2015-04-14 21:40:30,066 INFO archived 585331966372872192
2015-04-14 21:40:30,363 INFO archived 585331966372872192
2015-04-14 21:40:30,453 INFO archived 585331966372872192
2015-04-14 21:40:30,543 INFO archived 585331966372872192
2015-04-14 21:40:30,628 INFO archived 585331966372872192
2015-04-14 21:40:30,767 INFO archived 585331966372872192
2015-04-14 21:40:30,886 INFO archived 585331966372872192
2015-04-14 21:40:30,985 INFO archived 585331966372872192
2015-04-14 21:40:31,057 INFO archived 585331966372872192
2015-04-14 21:40:31,148 INFO archived 585331966372872192
2015-04-14 21:40:31,239 INFO archived 585331966372872192
2015-04-14 21:40:31,391 INFO archived 585331966372872192
2015-04-14 21:40:31,545 INFO archived 585331966372872192

Builds often timeout

Travis CI has a ~10 minutes timeout for builds, which is often hit for twarc:

https://travis-ci.org/edsu/twarc/builds

https://travis-ci.org/edsu/twarc/jobs/44435453

============================= test session starts ==============================
platform linux2 -- Python 2.6.9 -- py-1.4.26 -- pytest-2.6.4
collected 6 items 
test.py .....
No output has been received in the last 10 minutes, this potentially indicates a stalled build or something wrong with the build itself.
The build has been terminated

Any ideas what's causing this?

emit twitter ids only

It would be useful if twarc, or some similar command line tool, could only emit the tweet ids for a particular query. Apparently this is a popular-ish way for researchers to share twitter data sets without worrying about the ability to republish twitter data.

stream error

Saw this on a long running stream process after millions of tweets were archived:

(twarc)ubuntu@ip-10-39-110-115:/mnt/iran/data$ twarc.py --stream "Iran,Иран,Իրան,ﺈﻳﺭﺎﻧ,איראן,İran,ईरान,ইরান,Эрон,อิ อิหร่าน,इरान,イ ,이란,Іран" | gzip - > stream5.json.gz
Traceback (most recent call last):
  File "/home/ubuntu/.virtualenvs/twarc/bin/twarc.py", line 8, in <module>
    execfile(__file__)
  File "/home/ubuntu/twarc/twarc.py", line 228, in <module>
    main()
  File "/home/ubuntu/twarc/twarc.py", line 77, in main
    for tweet in tweets:
  File "/home/ubuntu/twarc/twarc.py", line 181, in stream
    for line in resp.iter_lines(chunk_size=512):
  File "/home/ubuntu/.virtualenvs/twarc/local/lib/python2.7/site-packages/requests/models.py", line 663, in iter_lines
    for chunk in self.iter_content(chunk_size=chunk_size, decode_unicode=decode_unicode):
  File "/home/ubuntu/.virtualenvs/twarc/local/lib/python2.7/site-packages/requests/models.py", line 630, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: IncompleteRead(333 bytes read)

Probably need a try/except block to catch stuff like this when it happens.

I use cron for it but it stopped to evaluate last saved IDS

It happens with any parameters:

~# ../../twarc/utils/summarize.py noscrape/*

noscrape/%23GTAC2014-20141028000122.json
  start: 526141358911021056 [Sat Oct 25 22:40:49 +0000 2014]
  end:   526884008718659584 [Mon Oct 27 23:51:51 +0000 2014]
  total: 29

noscrape/%23GTAC2014-20141028140453.json
  start: 526141358911021056 [Sat Oct 25 22:40:49 +0000 2014]
  end:   527095935399366656 [Tue Oct 28 13:53:58 +0000 2014]
  total: 36

~# ../../twarc/utils/summarize.py scrape/*

scrape/%23GTAC2014-20141027235924.json
  start: 489167543257403392 [Tue Jul 15 22:00:05 +0000 2014]
  end:   526884008718659584 [Mon Oct 27 23:51:51 +0000 2014]
  total: 95

scrape/%23GTAC2014-20141028140507.json
  start: 489167543257403392 [Tue Jul 15 22:00:05 +0000 2014]
  end:   527095935399366656 [Tue Oct 28 13:53:58 +0000 2014]
  total: 104

unshorten.py produces invalid json

I confirmed a collection's validity with validate.py prior to running unshorten.py. After running unshorten.py on a collection, and checking it's validity with validate.py, I gets lots of errors.

Sample:

uhoh, we got a problem on line: 9376769
No JSON object could be decoded
uhoh, we got a problem on line: 9413314
No JSON object could be decoded
uhoh, we got a problem on line: 9457029
No JSON object could be decoded
uhoh, we got a problem on line: 9470191
No JSON object could be decoded
uhoh, we got a problem on line: 9474397
No JSON object could be decoded
uhoh, we got a problem on line: 9500591
No JSON object could be decoded
uhoh, we got a problem on line: 9506738
No JSON object could be decoded
uhoh, we got a problem on line: 9517267
No JSON object could be decoded
uhoh, we got a problem on line: 9545542
No JSON object could be decoded
uhoh, we got a problem on line: 9567288
No JSON object could be decoded
uhoh, we got a problem on line: 9632298
No JSON object could be decoded
uhoh, we got a problem on line: 9676049
No JSON object could be decoded
uhoh, we got a problem on line: 9689651
No JSON object could be decoded
uhoh, we got a problem on line: 9761634
No JSON object could be decoded
uhoh, we got a problem on line: 9773360
No JSON object could be decoded
uhoh, we got a problem on line: 9943500
No JSON object could be decoded
uhoh, we got a problem on line: 9967734
No JSON object could be decoded
uhoh, we got a problem on line: 10024047
No JSON object could be decoded
uhoh, we got a problem on line: 10063945
No JSON object could be decoded

The invalid json then prevents me from getting a list of the top urls in a collection with urls.py because there are many invalid json objects in the file.

$ cat JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-tweets-unshortened-urls-20150129.json | ~/git/twarc/utils/urls.py | sort | uniq -c | sort -n > JeSuisCharlie-JeSuisAhmed-JeSuisJuif-CharlieHebdo-urls-20150129.txt
Traceback (most recent call last):
  File "/home/nruest/git/twarc/utils/urls.py", line 11, in <module>
    tweet = json.loads(line)
  File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode
    raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded

Is it trivial to have unshorten.py write back valid json?

last git pull broke twarc

I used it since a couple of week, but after I did a git pull a few minutes ago I broke it.
Clean, upgrade, solved requirements, uninstall, reinstall using ENVs or config.py doesn't solve anything.
I get this every time:

Traceback (most recent call last):
  File "./twarc.py", line 256, in <module>
    archive(args.query, search(args.query, since_id=since_id, max_id=args.max_id, scrape=args.scrape))
  File "./twarc.py", line 180, in archive
    for status in statuses:
  File "./twarc.py", line 113, in search
    results, max_id = search_result(q, since_id, max_id)
  File "./twarc.py", line 127, in search_result
    client = TwitterClient()
  File "./twarc.py", line 37, in __init__
    self.ping()
  File "./twarc.py", line 100, in ping
    self.reset = int(response["x-rate-limit-reset"])
KeyError: 'x-rate-limit-reset'

pass keys in as command line arguments?

Should you be able to pass the twitter keys on the command line, and to the Twarc constructor?

remove archive file naming logic

I think it would simplify the code quite a bit if twarc simply wrote tweets to stdout and let the user decide what file they should go.

When run repeatedly twarc tries to determine the since_id to use when talking to the Twitter API based on data that has already been archived. But this functionality is dependent on twarc being run in the same directory as the other archive files, and the filenames matching a particular pattern (which can get ugly). The determination of the since_id isn't working properly with files created with --stream since they are ordered differently.

I propose this logic is removed and we add a --min_id option to match --max_id. The user can then control what they want to do, and where the data goes.

return "pick up where you left off" feature back to search

It's been a few months since I used Twarc and upon returning to it, I see that the feature of "picking up where you left off" has been moved to archive.py. If I'm understanding the reasoning behind this, it's that moving it elsewhere leaves room for people to send their data where they want to.

That makes sense, but I cannot think of an instance in which I would not want this feature. If the program shuts down for any reason, you will need this feature. Why would I use the --search option given that all the features of search are contained in archive.py with the added benefit of being able to start where I left off should the program fail for some reason. It should be noted, I do not think of all things, so reasons may exist.

My limited experience aside, the only other places I think one would want to write data is to either a database or a program that does something with the data first and then passes it on. While doing so might be preferred to a single json file, what I love about twarc is it's simplicity. It writes to a json file, and you can manage the json file afterwards. Using twarc in any other capacity would take enough re-writing on the part of the user to make it work that it's not worth losing this as a default feature. I think it would be preferable to have a flag that sets the output to a different location than it would be to assume that you want it to go to a different location by default.

use six

Instead of trying to load different config modules for different versions twarc should use six.

Archive a single user's timeline?

What would be the best way to archive a single user's tweets?

One way is to search for their @ username, and then filter by creator, but is there a cleaner way?

Incomplete results since recent releases

I have daily runs of lots of keywords & hashtags dump, but all runs querying for simple keywords results contains only last 8-12 days of twitter. I tested it too with Debian and OSX.
In my case happens with all searches.

In the past months I was able to get both: deltas and entire history. Since a few days, and before last release too, I was no more able to get entire deltas (if the batch didn't run since a while). Neither entire history.

Some of this daily are to check & compare history in verifying if releases or twitter is working fine.
There are no errors so at first I thought again in Twitter search API issues, anyone else ?

D3-friendly outputs and templates

My use case involves relatively small sets of tweets (~10k), from which I want to extract data to feed into various D3 visualizations: timelines, graphs, etc. I'll therefore be putting in a couple of PRs shortly (I hope), but some of this will stretch my python skills to the limit or beyond. I've got some work under way for parts of it but I don't lay claim to any of it.

refactor the force-directed graph code to use a two-step process: a task-specific step to generate a json/csv data output, and a generic step to embed that output in a specified html template so that it's easy to get a quick look at your data. The data outputs would conform to the styles used most commonly in D3 examples, to make it easy to connect a given body of tweets to a given D3 example.
(more speculative) refactor some of the current utilities to clarify the distinction between those that filter a tweet file (outputting a tweet file) and those that produce some other output, to make it easier to think in terms of pipelines. Call them filters and analyzers?
add a filter to store a new field in each tweet of a tweet file. For one project I have a requirement to work with a local timezone rather than UTC, and it will be convenient to add the local time as a new field for further processing by other filters.
add an analyzer to generate counts of co-occurrences of values in arbitrary fields. I've written one that will work specifically on hashtags (how many tweets in this set have #a and #b), which I feed into a D3 force-directed graph, but there's no reason not to make it generic to allow e.g. co-occurrence of mentioned users, or hashtags and mentioned users.
some day: a core group of a few D3 visualizations that would be useful for any set of tweets to show (say) the temporal dimensions, histograms of users and hashtags, etc., that could be run easily to get an overview of a harvested body of tweets.

Comments and suggestions (and code!) welcome.

Issue at every first run

I got this for a while on some Debian boxes, now the same on my clean OSX box. At first run, it takes time but it will stop like if requests package isn't installed. In this case when still remains 119 API attempt. All next run work fine, but first fail limits tweets results.
It happens with and without --scrape. The example is with lots of tweets, if results are limited seems working fine.

$ twarc.py --scrape "#moncler #report"

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/2.7/bin/twarc.py", line 283, in <module>
    archive(args.query, tweets)
  File "/Library/Frameworks/Python.framework/Versions/2.7/bin/twarc.py", line 197, in archive
    for status in statuses:
  File "/Library/Frameworks/Python.framework/Versions/2.7/bin/twarc.py", line 123, in search
    for status in scrape_tweets(q, max_id=max_id):
  File "/Library/Frameworks/Python.framework/Versions/2.7/bin/twarc.py", line 210, in scrape_tweets
    for tweet_id in scrape_tweet_ids(query, max_id, sleep=1):
  File "/Library/Frameworks/Python.framework/Versions/2.7/bin/twarc.py", line 233, in scrape_tweet_ids
    r = requests.get(url, params=q)
NameError: global name 'requests' is not defined

The example is getting tweets, so you could relaunch it. It will work. Drop all *.json and starting again it fails in the same way.

Documentation should be --hydrate, not --rehydrate

In the documentation on the front page of the repo, it uses the flag --rehydrate, but I believe it should by --hydrate. --rehydrate does not work for me, and as far as I can tell, the flag in the code is --hydrate.

streaming error

Hello,

I am using your script for school and I noticed when running it as library for streaming, it sometimes (one per minute) raise error (ERROR:root:json parse error: No JSON object could be decoded - ). Can I ask you what to do with that to make it work properly?

Thanks a lot.

Filip Hadac

Document --query in examples on README

Since --query was added in f6ccea9 the examples in the README need updating.

--scrape not grabbing everything

It appears that --scrape isn't grabbing everything that is available on the search screen. For example, I'm trying to grab everything from this query, and this is what I am receiving. Log is here. It looks like it is not grabbing anything before March.

I'm more than happy to try and hack on this. Just need some advice where to start 😉 Any thoughts?

log info about dropped tweets

When the volume on the Twitter filter stream gets too high they cannot deliver all tweets. We should log this information when Twitter provides it.

--query will not detect previous file if query has non-alpha characters

--query has the ability to stop where it left off. However, it was not working for me. In testing, this was not a problem when the filename only had alphanumeric characters, but when I used a query that included punctuation, in this case a hash, the functionality would not work.

This is because the name of the files are based on the string that is sent to twitter and the string that is sent to twitter has the quote() function performed on it first. So if query q = "#somekeyword", then the query that is sent to twitter is "%23somekeyword".

However, in the last_archive function, when --query looks up previous filenames for a stop ID, it is matching the filename with q prior to having the quote() function performed on it. It's matching q which is "#somekeyword" with a file that begins with "%23somekeyword".

I'm not sure this is either the best solution, or if it is, the best way to implement this solution, but in the last_archive function, matching quote(q, safe='') with the filename instead of just q on 218 fixed the problem.

see below:

def last_archive(q):
     other_archive_files = []
     for filename in os.listdir("."):
          if re.match("^%s-\d+\.json$" % quote(q, safe=''), filename):
               other_archive_files.append(filename)
     other_archive_files.sort()
     while len(other_archive_files) != 0:
          f = other_archive_files.pop()
          if os.path.getsize(f) > 0:
               return f
     return None

empty tweet json files cause subsequent lookups to refetch everything

19:32 <anarchivist> edsu: so, how often should i run twarc? if there are no
                  new tweets, it writes a 0 byte file, and does a new fetch 
                  of everything the next time it gets called.

difference about oldest tweet with/without --scrape

no errors in logs. Same hashtag, frequently, I get older without --scrape but with less results if adding --scrape:

  without --scrape
  start: 518128367799787520 [Fri Oct 03 20:00:03 +0000 2014]
  end:   521062912107216897 [Sat Oct 11 22:20:53 +0000 2014]
  total: 5540
  with --scrape
  start: 520990845244563457 [Sat Oct 11 17:34:31 +0000 2014]
  end:   521058075718189056 [Sat Oct 11 22:19:40 +0000 2014]
  total: 5709

Do I miss how differ "sync points"?

Making Your Code Citable

Is this something that might be useful for twarc?

https://guides.github.com/activities/citable-code/

Remove --scrape mode

Twitter's ToS pretty explicitly forbid what the --scrape option does. I'm removing this functionality so we play nicely:

access or search or attempt to access or search the Services by any means (automated or otherwise) other than through our currently available, published interfaces that are provided by Twitter (and only pursuant to those terms and conditions), unless you have been specifically allowed to do so in a separate agreement with Twitter (NOTE: crawling the Services is permissible if done in accordance with the provisions of the robots.txt file, however, scraping the Services without the prior consent of Twitter is expressly prohibited);

Plus it never seemed to work terribly well, and complicated the code a bit.

hydrate read from stdin

It would be nice if it were possible to have twarc read a stream of ids from stdin rather than requiring them to be in an uncompressed file:

zcat ids.txt.gz | twarc.py --hydrate | gzip - > tweets.json.tz

This would make it similar to the various scripts in utils.

Official Twitter bug affects twarc users too (error 404)

There is an official bug since a couple of week (Twitter API calls on all major platform but their apps too) get frequently 404 with unknown state. It's discussed too here:
https://twittercommunity.com/t/intermittent-404-responses-from-rest-api/46712
Despite apparent resolution by Twitter it started to raise again.
Probably users of Twarc too got it recently (requests.exceptions.HTTPError: 404 Client Error: Not Found), it looks like this:

2015-07-18 16:48:49,156 INFO archived 622089779308593152
Traceback (most recent call last):
  File "../../twarc/utils/archive.py", line 141, in <module>
    main()
  File "../../twarc/utils/archive.py", line 107, in main
    for tweet in tweets:
  File "build/bdist.macosx-10.6-intel/egg/twarc.py", line 235, in search
  File "build/bdist.macosx-10.6-intel/egg/twarc.py", line 188, in new_f
    return None
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py", line 851, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found

If it happens retry your search again till you you get it without errors. It's not a permanent failure but an intermittent failure by Twitter.
@edsu proposed to introduce a catch&retry. I share this because my opinion is that it'll happen in the future too since they killed the Whale.

stream failure

After running twarc in stream mode for a few days it died with this error:

(twarc)ubuntu@ip-10-39-110-115:/mnt/cuba$ ~/twarc/twarc.py --stream --query "#cuba,#cubapolicy"
Traceback (most recent call last):
  File "/home/ubuntu/twarc/twarc.py", line 368, in <module>

  File "/home/ubuntu/twarc/twarc.py", line 241, in archive
    """
KeyError: 'id_str

archive.py usage?

It would be nice if archive.py reported a friendly error message when not all arguments are supplied, instead of this:

Natalies-MacBook-Air:Library nataliebaur$ utils/archive.py

-bash: utils/archive.py: No such file or directory

Natalies-MacBook-Air:Library nataliebaur$ utils/archive.py semanaNT /mnt/semanaNT/semanaNT

-bash: utils/archive.py: No such file or directory

Natalies-MacBook-Air:Library nataliebaur$

too much retry fails

At first start of twarc --scrape, and after fetching for a while, it stops with this error looping to fetch an IDs.

2014-10-07 16:35:49,958 ERROR unable to fetch https://api.twitter.com/1.1/statuses/show.json?id=519514618390392832 - too many tries!

not at next relaunch. If you drop all json, with same query it repait all fails.
It looks like it get a missing IDs to fetch.

TypeError on archive.py

I am unable to get utils/archive.py to work anymore. I am able to replicate this on two machines and will continue to check to ensure this is pilot error. In the meantime.

Run a twarc.py --search yoursearch > tweets.json

The utils/archive.py yoursearch /path/to/save

produces (in my case)

Traceback (most recent call last):
File "twarc/utils/archive.py", line 110, in
main()
File "twarc/utils/archive.py", line 62, in main
t = twarc.Twarc()
TypeError: init() takes exactly 5 arguments (1 given)

Cheers,
./fxk

profiles

It would be handy if twarc had a notion of profiles, similar to amazon's aws-cli. This way the various keys could be saved in a file, and there could be a default.

Wwhen started without any keys twarc could prompt for the consumer keys, give the user a URL to visit in their browser to get the access keys, and then save all the keys in their default profile ($HOME/.twarcrc).

same query results with mismatches

I'm testing current release with a new hashtags: used only yesterday on twitter. I get different quantities with same query, repeating it every 30min, both with and without --scrape.
With summarize.py I verify that also timing differ.

Hydrate logging "Resetting dropped connection: api.twitter.com"

Not sure it's related with #55 because I hydrate a previous twarc session in analyzing series with lots of spam. I started with a 28.141 tweets's ids and obtaining 27.641 tweets. Lost 500 but I found 5 dropped connection. Looking at numbers looks like getting an error the entire set of 100 are lost.

2015-02-20 12:48:18,320 INFO Starting new HTTPS connection (1): api.twitter.com
2015-02-20 12:48:32,347 INFO Resetting dropped connection: api.twitter.com
2015-02-20 12:48:47,241 INFO Resetting dropped connection: api.twitter.com
2015-02-20 12:49:02,603 INFO Resetting dropped connection: api.twitter.com
2015-02-20 13:03:33,392 INFO Resetting dropped connection: api.twitter.com
2015-02-20 13:03:48,445 INFO Resetting dropped connection: api.twitter.com

I repeated the hydrate a few hours later getting the same 5 dropped connection at the same point of the previous one (checked previous and next ids's tweet too).
Could it be that it's kinda new API's moderation?