GithubHelp home page GithubHelp logo

twitterscraper's Introduction

Downloads Downloads_month PyPI version GitHub contributors

Backers

Thank you to all our backers! 🙏 [Become a backer]

Sponsors

Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]

Synopsis

A simple script to scrape Tweets using the Python package requests to retrieve the content and Beautifulsoup4 to parse the retrieved content.

1. Motivation

Twitter has provided REST API's which can be used by developers to access and read Twitter data. They have also provided a Streaming API which can be used to access Twitter Data in real-time.

Most of the software written to access Twitter data provide a library which functions as a wrapper around Twitter's Search and Streaming API's and are therefore constrained by the limitations of the API's.

With Twitter's Search API you can only send 180 Requests every 15 minutes. With a maximum number of 100 tweets per Request, you can mine 72 tweets per hour (4 x 180 x 100 =72) . By using TwitterScraper you are not limited by this number but by your internet speed/bandwith and the number of instances of TwitterScraper you are willing to start.

One of the bigger disadvantages of the Search API is that you can only access Tweets written in the past 7 days. This is a major bottleneck for anyone looking for older data. With TwitterScraper there is no such limitation.

Per Tweet it scrapes the following information:
  • Tweet-id
  • Tweet-url
  • Tweet text
  • Tweet html
  • Links inside Tweet
  • Hashtags inside Tweet
  • Image URLS inside Tweet
  • Video URL inside Tweet
  • Tweet timestamp
  • Tweet Epoch timestamp
  • Tweet No. of likes
  • Tweet No. of replies
  • Tweet No. of retweets
  • Username
  • User Full Name / Screen Name
  • User ID
  • Tweet is an reply to
  • Tweet is replied to
  • List of users Tweet is an reply to
  • Tweet ID of parent tweet
In addition it can scrape for the following user information:
  • Date user joined
  • User location (if filled in)
  • User blog (if filled in)
  • User No. of tweets
  • User No. of following
  • User No. of followers
  • User No. of likes
  • User No. of lists
  • User is verified

2. Installation and Usage

To install twitterscraper:

(sudo) pip install twitterscraper

or you can clone the repository and in the folder containing setup.py

python setup.py install

If you prefer more isolation you can build a docker image

docker build -t twitterscraper:build .

and run your container with:

docker run --rm -it -v/<PATH_TO_SOME_SHARED_FOLDER_FOR_RESULTS>:/app/data twitterscraper:build <YOUR_QUERY>

2.2 The CLI

You can use the command line application to get your tweets stored to JSON right away. Twitterscraper takes several arguments:

  • -h or --help Print out the help message and exits.
  • -l or --limit TwitterScraper stops scraping when at least the number of tweets indicated with --limit is scraped. Since tweets are retrieved in batches of 20, this will always be a multiple of 20. Omit the limit to retrieve all tweets. You can at any time abort the scraping by pressing Ctrl+C, the scraped tweets will be stored safely in your JSON file.
  • --lang Retrieves tweets written in a specific language. Currently 30+ languages are supported. For a full list of the languages print out the help message.
  • -bd or --begindate Set the date from which TwitterScraper should start scraping for your query. Format is YYYY-MM-DD. The default value is set to 2006-03-21. This does not work in combination with --user.
  • -ed or --enddate Set the enddate which TwitterScraper should use to stop scraping for your query. Format is YYYY-MM-DD. The default value is set to today. This does not work in combination with --user.
  • -u or --user Scrapes the tweets from that users' profile page. This also includes all retweets by that user. See section 2.2.4 in the examples below for more information.
  • --profiles : Twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets. The results will be saved in the file userprofiles<filename>.
  • -p or --poolsize Set the number of parallel processes TwitterScraper should initiate while scraping for your query. Default value is set to 20. Depending on the computational power you have, you can increase this number. It is advised to keep this number below the number of days you are scraping. For example, if you are scraping from 2017-01-10 to 2017-01-20, you can set this number to a maximum of 10. If you are scraping from 2016-01-01 to 2016-12-31, you can increase this number to a maximum of 150, if you have the computational resources. Does not work in combination with --user.
  • -o or --output Gives the name of the output file. If no output filename is given, the default filename 'tweets.json' or 'tweets.csv' will be used.
  • -c or --csv Write the result to a CSV file instead of a JSON file.
  • -d or --dump: With this argument, the scraped tweets will be printed to the screen instead of an outputfile. If you are using this argument, the --output argument doe not need to be used.
  • -ow or --overwrite: With this argument, if the output file already exists it will be overwritten. If this argument is not set (default) twitterscraper will exit with the warning that the output file already exists.
  • -dp or --disableproxy: With this argument, proxy servers are not used when scrapping tweets or user profiles from twitter.

2.2.1 Examples of simple queries

Below is an example of how twitterscraper can be used:

twitterscraper Trump --limit 1000 --output=tweets.json

twitterscraper Trump -l 1000 -o tweets.json

twitterscraper Trump -l 1000 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json

2.2.2 Examples of advanced queries

You can use any advanced query Twitter supports. An advanced query should be placed within quotes, so that twitterscraper can recognize it as one single query.

Here are some examples:

  • search for the occurence of 'Bitcoin' or 'BTC': twitterscraper "Bitcoin OR BTC" -o bitcoin_tweets.json -l 1000
  • search for the occurence of 'Bitcoin' and 'BTC': twitterscraper "Bitcoin AND BTC" -o bitcoin_tweets.json -l 1000
  • search for tweets from a specific user: twitterscraper "Blockchain from:VitalikButerin" -o blockchain_tweets.json -l 1000
  • search for tweets to a specific user: twitterscraper "Blockchain to:VitalikButerin" -o blockchain_tweets.json -l 1000
  • search for tweets written from a location: twitterscraper "Blockchain near:Seattle within:15mi" -o blockchain_tweets.json -l 1000

You can construct an advanced query on Twitter Advanced Search or use one of the operators shown on this page. Also see Twitter's Standard operators

2.2.3 Examples of scraping user pages ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can also scraped all tweets written or retweeted by a specific user. This can be done by adding the boolean argument -u / --user argument. If this argument is used, the search term should be equal to the username.

Here is an example of scraping a specific user:

twitterscraper realDonaldTrump --user -o tweets_username.json

This does not work in combination with -p, -bd, or -ed.

The main difference with the example "search for tweets from a specific user" in section 2.2.2 is that this method really scrapes all tweets from a profile page (including retweets). The example in 2.2.2 scrapes the results from the search page (excluding retweets).

2.3 From within Python

You can easily use TwitterScraper from within python:

from twitterscraper import query_tweets

if __name__ == '__main__':
    list_of_tweets = query_tweets("Trump OR Clinton", 10)

    #print the retrieved tweets to the screen:
    for tweet in query_tweets("Trump OR Clinton", 10):
        print(tweet)

    #Or save the retrieved tweets to file:
    file = open(“output.txt”,”w”)
    for tweet in query_tweets("Trump OR Clinton", 10):
        file.write(str(tweet.text.encode('utf-8')))
    file.close()

2.3.1 Examples of Python Queries

  • Query tweets from a given URL:
    Parameters:
    • query: The query search parameter of url
    • lang: Language of queried url
    • pos: Parameter passed for where to start looking in url
    • retry: Number of times to retry if error
    query_single_page(query, lang, pos, retry=50, from_user=False, timeout=60)
  • Query all tweets that match qeury:
    Parameters:
    • query: The query search parameter
    • limit: Number of tweets returned
    • begindate: Start date of query
    • enddate: End date of query
    • poolsize: Tweets per poolsize
    • lang: Language of query
    query_tweets('query', limit=None, begindate=dt.date.today(), enddate=dt.date.today(), poolsize=20, lang='')
  • Query tweets from a specific user:
    Parameters:
    • user: Twitter username
    • limit: Number of tweets returned
    query_tweets(user, limit=None)

2.4 Scraping for retweets ----------------------

A regular search within Twitter will not show you any retweets. Twitterscraper therefore does not contain any retweets in the output.

To give an example: If user1 has written a tweet containing #trump2020 and user2 has retweetet this tweet, a search for #trump2020 will only show the original tweet.

The only way you can scrape for retweets is if you scrape for all tweets of a specific user with the -u / --user argument.

2.5 Scraping for User Profile information ----------------------By adding the argument --profiles twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets. The results will be saved in the file "userprofiles<filename>".

Try not to use this argument too much. If you have already scraped profile information for a set of users, there is no need to do it again :) It is also possible to scrape for profile information without scraping for tweets. Examples of this can be found in the examples folder.

3. Output

All of the retrieved Tweets are stored in the indicated output file. The contents of the output file will look like:

[{"fullname": "Rupert Meehl", "id": "892397793071050752", "likes": "1", "replies": "0", "retweets": "0", "text": "Latest: Trump now at lowest Approval and highest Disapproval ratings yet. Oh, we're winning bigly here ...\n\nhttps://projects.fivethirtyeight.com/trump-approval-ratings/?ex_cid=rrpromo\u00a0\u2026", "timestamp": "2017-08-01T14:53:08", "user": "Rupert_Meehl"}, {"fullname": "Barry Shapiro", "id": "892397794375327744", "likes": "0", "replies": "0", "retweets": "0", "text": "A former GOP Rep quoted this line, which pretty much sums up Donald Trump. https://twitter.com/davidfrum/status/863017301595107329\u00a0\u2026", "timestamp": "2017-08-01T14:53:08", "user": "barryshap"}, (...)
]

3.1 Opening the output file

In order to correctly handle all possible characters in the tweets (think of Japanese or Arabic characters), the output is saved as utf-8 encoded bytes. That is why you could see text like "u30b1 u30f3 u3055 u307e u30fe ..." in the output file.

What you should do is open the file with the proper encoding:

Example of output with Japanese characters

Example of output with Japanese characters

3.1.2 Opening into a pandas dataframe ---------------------------

After the file has been opened, it can easily be converted into a `pandas` DataFrame

import pandas as pd
df = pd.read_json('tweets.json', encoding='utf-8')

twitterscraper's People

Contributors

0xmilly avatar adtac avatar adupuis2 avatar attalakheireddine avatar b3ql avatar bizso09 avatar calclavia avatar cenguix avatar danp1925 avatar dzautner avatar educatorsrlearners avatar haidyi avatar hdnl avatar im-n1 avatar isaacimholt avatar kanihal avatar linqlover avatar nearlyeveryone avatar nukopy avatar patrickdundas avatar petrbel avatar rachadabichahine avatar samirchar avatar sils avatar taspinar avatar twollnik avatar wildgarden avatar yitongl avatar ylijokic avatar yvelkram avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

twitterscraper's Issues

FakeUserAgentError: Error occurred during getting browser

Running twitterscraper, I ran into this error using the example given in the readme twitterscraper Trump%20since%3A2017-01-03%20until%3A2017-01-04 -o tweets.json

I was running a version from March and then upgraded to the latest master.zip but I still got the same error... Any ideas on how to resolve this? I'm running Ubuntu 16.04...

Traceback (most recent call last):
  File "/usr/local/bin/twitterscraper", line 9, in <module>
    load_entry_point('twitterscraper==0.3.1', 'console_scripts', 'twitterscraper')()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 542, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2569, in load_entry_point
    return ep.load()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2229, in load
    return self.resolve()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2235, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "build/bdist.linux-x86_64/egg/twitterscraper/__init__.py", line 13, in <module>
  File "build/bdist.linux-x86_64/egg/twitterscraper/query.py", line 14, in <module>
  File "/usr/local/lib/python2.7/dist-packages/fake_useragent/fake.py", line 139, in __getattr__
    raise FakeUserAgentError('Error occurred during getting browser')  # noqa
fake_useragent.errors.FakeUserAgentError: Error occurred during getting browser

Python 3 support

Would be nice to be able to use this in python 3.

(pythonclock.org :))

Issues with since and until in commandline

twitterscraper "%24PEP"%20since%3A2017-10-05 -o pep.out

this works, but when running it

twitterscraper "%24PEP"%20since%3A2017-10-05%20until%3A2017-10-05 -o pep.out

it doesnt work.

Ie. i want to limit the results to only one single day. wont' work.

ImportError: No module named 'tweet'

I get the following error when trying to use this.
Installed in a venv via pip

Traceback (most recent call last):
  File "collector.py", line 1, in <module>
    import twitterscraper
  File "/home/m0hawk/Documents/dev/TUHH/testvenv/lib/python3.5/site-packages/twitterscraper/__init__.py", line 13, in <module>
    from twitterscraper.query import query_tweets
  File "/home/m0hawk/Documents/dev/TUHH/testvenv/lib/python3.5/site-packages/twitterscraper/query.py", line 14, in <module>
    from tweet import Tweet
ImportError: No module named 'tweet'

Use proper dates

We're currently extracting human readable timestamps, however there's a property data-time-ms in the span within the a which contains it: <span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-time="1476057559" data-time-ms="1476057559000" data-long-form="true">Oct 9</span> - parsing the other string into proper date objects is almost impossible, they sometimes contain AM/PM, sometimes not, sometimes dots here and there, sometimes not, occasionally I get localized months...

Encoding issue when applying the script to non-English language

Dear author
Thanks very much for your kind work! I am a beginner on python programming and hope it would not trouble you too much.
Here is the problem that I am gonna apply the script on mining non-English text (for twitter advanced search page), such as "戦う", while non-English text in the output file is always displayed as unicode like "'\xe7\x8e\xb2\xe9\ ... ..."
even when typing the command "print(tweet.text.encode('utf-8'))"(or with other encode), the output is still the same.
I am wondering if there is some specific measures to display the non-English text correctly?
Thanks!

Missing the Output JSON file ...

This question might sound silly, but I am able to use TwitterScraper successfully (with the command twitterscraper “” --output=tweets.json”. but I am unable to retrieve my json file (Logging shows that data is being collected: Example :
INFO: Got 137 tweets (20 new).
INFO: Got 157 tweets (20 new).
INFO: Got 177 tweets (19 new).
INFO: Got 196 tweets (19 new).
¸INFO: Got 215 tweets (17 new).
INFO: Got 232 tweets (20 new).
INFO: Got 252 tweets (19 new).
)
Specifying the exact path /Users/blahblah/tweets.JSON did not make a difference.
What am I missing? Thanks for your help in advance,

Source parameter is not passed accurately to the script

When running twitterscraper from command line, the source parameter is not accurately passed to the script if used with apostrophe.
Example:
#news AND source:"Twitter for Android"
twitterscraper %23news%20AND%20source%3A"Twitter%20for%20Android" --output=tweets_new_Android.json

tweets_new_Android.json is empty, but https://twitter.com/search?q=%23news%20AND%20source%3A%22Twitter%20for%20Android%22&src=typd shows results.
it works for sources without apostrophe:
#news AND source:"Tweetdeck"
twitterscraper %23news%20AND%20source%3A"Tweetdeck" --output=tweets_new_Tweetdeck.json

Likes, Retweets, Replies not being parsed if (> 999)

Tweet data for these fields is not being properly parsed if the values exceed 999.

I suspect that it relates to the fact that Twitter displays those values with letters in them. eg, "1.1k" instead of 1100.

In any case, Twitterscraper returns those values as 0.

How to scrape users' tweets

I'm trying to extract specified users' tweets.
By using this command line: twitterscraper Trump --limit 100 --output=tweets.json

it just extracts all twists that the person name is mentioned in it instead of the users' tweets.

My question, how can extract all specified users' tweets
Thank you...

control-C Does not seem to stop Parser execution or save results

If I control C out of the command line execution the program does not seem to save its results anywhere. The program also continues its execution on a second iteration, which is not always desired. I ran a large search last night on separate machines and neither of them saved their search data when control-c was used

AttributeError on module requests

Hello @taspinar

I just found a bug while tweet scraping. When my connection is unstable I got the error message like following :

ERROR:root:An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twitterscraper\query.py", line 93, in query_tweets_once
pos is None
File "C:\Python27\lib\site-packages\twitterscraper\query.py", line 53, in query_single_page
except requests.exception.ConnectionError as e:
AttributeError: 'module' object has no attribute 'exception'

Solved, just need to updgrade.

Add option to query by language

Given that this is just parameter in the Twitter API, it should be easy to do, and frustrating that it isn't already available

Number of tweets in final JSON file much smaller than reported during run

So I ran the scraper for a tweeting period of around a year, with the limit of 40.000, so:

twitterscraper "%23bitcoin AND %23bubble since%3A2016-09-01 until%3A2017-10-10&src=typd" -l 40000 -o bitcoinbubble.json

While running it counted all the way up to 40 thousand:
INFO: Got 39953 tweets (18 new).
INFO: Got 39971 tweets (19 new).
INFO: Got 39990 tweets (17 new).
INFO: Got tweets ranging from 2017-09-08 to 2017-10-09

But when i load the json file, it only contains 1528 tweets - what explains this?

Add ability to output to stdout rather than output to file

Reading the stdout of a command is much more efficient when handling a lot of requests, rather than taxing the server memory by creating many output json files. I believe that having an option to output to the console as stdout rather than outputting to a file would be a great feature that would expand the way that people can use this project.

More attributes

Is it possible to get more attributes like number of retweets, replies, and favorites? This is a feature request I guess

UTF-8 self.writer.writerow(post) issue

Hello !

I'm trying to scrap every tweets from an account. My script is quite simple :

_!/usr/bin/env python
encoding: utf-8

from twitterscraper import TwitterScraper

topic = ""
cible = "username"
filename = 'username_tweets.csv'
scraper = TwitterScraper.Scraper(topic, 21000, authors=cible, filename = filename)
scraper.scrape()_

It works for hundreds of tweets, but then I've got these error :

Traceback (most recent call last):
File "myscript.py", line 10, in
scraper.scrape()
File "/usr/local/lib/python2.7/dist-packages/twitterscraper/TwitterScraper.py", line 148, in scrape
self.write(post)
File "/usr/local/lib/python2.7/dist-packages/twitterscraper/TwitterScraper.py", line 136, in write
self.writer.writerow(post)

(Yes, I'm using python 2.7, don't know if the problem came from here or not)

Thanks in advance

returning usernames, not tweets

import twitterscraper as ts
'usr='kingjames'
for tweet in ts.query_tweets(usr,10)[:10]:
    print(tweet.user.encode('utf-8'))
#out:
b'Rypuur'
b'Powperezdiez'
b'joey_a_george'
b'mikey_rakkar'
b'yarapgv'
b'V_Nasty10'
b'downtownbrownxx'
b'DeclanJoyce'
b'atnissaa'
b'WestifiedMJ'

Basic Example/Documentation in Python file

Hi there,

It would be great, if there was a basic example that does the following in a Python script:

  1. set a query for a specific time
  2. save the data on the hard drive

Probably this is just me being new to Python, but a general documentation with a brief description for each functionality would also be nice.

Thanks in advance!

JSONify tweets properly

the namedtuple just jsonifies them as tuples, would be better to be more dict like and have the member names as keys in the outputted JSON

Zero Result

Hello @taspinar

Recently I run twitterscraper from my command line.

C:\Python27\Scripts\twitterscraper Telkomsel -o tweets.json

Unfortunately, resulting zero result. But if I add another keyword like Telkomsel mengecewakan resulting the tweet related keyword.

In the other hand, if I write

C:\Python27\Scripts\twitterscraper Trump -o tweets.json it runs very well.

Why it happens ?

This is weird, I checked Telkomsel on Twitter, sometimes it reloads and sometimes it stucks at all. Is it part of Twitter bug ?

Advance Query

Hello @taspinar

I'm new at programming.

Could you please give an example about an advanced query. In particular scraping by location and specific time.

Thankyou

Advanced query

Docu gives:
"You can use any advanced query twitter supports. Simply compile your query at https://twitter.com/search-advanced."

Lets say I try to get all tweets from user 'username'
I get the url https://twitter.com/search?f=tweets&q=from%3Ausername&src=typd
Which part (if not the whole url) is the query?

Can‘t get data earlier than 12 days ago

Hi Taspinar and Sils,

I was collecting movie data of last year today, it seems like the date issue occurring again, I cannot get the data earlier than 12 days ago :( and I have tried many times. It's as if some sort of notification occurred that enabled Twitter to know I was trying to go back further than 12 days. So how can I solve this problem?

Thank you so much!

Unknown Error

while running TwitterScraper "test" --output tweets.json --all for ~10 minutes

ERROR: An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
  File "/home/lasse/prog/tie/twitterscraper/twitterscraper/query.py", line 96, in query_tweets_once
    pos is None
  File "/home/lasse/prog/tie/twitterscraper/twitterscraper/query.py", line 46, in query_single_page
    tweets = list(Tweet.from_html(html))
  File "/home/lasse/prog/tie/twitterscraper/twitterscraper/tweet.py", line 34, in from_html
    yield cls.from_soup(tweet)
  File "/home/lasse/prog/tie/twitterscraper/twitterscraper/tweet.py", line 19, in from_soup
    user=tweet.find('span', 'username').text[1:],
AttributeError: 'NoneType' object has no attribute 'text'

twitterscraper: Command not found

So whenever I am trying to run this command on my server, it's saying "Command not found". I have installed it in my home directory. Please help. Any help would be appreciated.

except urllib2.HTTPError, e: (invalid syntax)

Hello again !

I've just tried my precedent script in Python 3, and got immediately these error :

File "myscript.py", line 4, in
from twitterscraper import TwitterScraper
File "/usr/local/lib/python3.4/dist-packages/twitterscraper/TwitterScraper.py", line 109
except urllib2.HTTPError, e:
^
SyntaxError: invalid syntax

Maybe it's a naive alternative, but I've discovered recently requests and found this module more powerful than urllib. Here some scraping example with requests !

Scrape tweet url

Seems to be nestled in data-permalink-path, should be an easy scrape

Inconsistent results among multiple runs

I am using twitterscraper to get the replies to some twitter accounts.

I am running the following queries as a test:

to%3Amatteorenzi%20since%3A2017-08-21%20until%3A2017-08-27
to%3Amatteosalvinimi%20since%3A2017-08-21%20until%3A2017-08-27

When performing multiple runs I get a different amount of results each time as below, with left number being the result of first query and right one for the second. Each line is a different run.

544, 4216
386, 4121
295, 4180

Why does this happen? Any way I can prevent it?

Error in example code (Readme.md)

Taspinar and Sils, nice job!

A little issue in Readme.md usage example: "print(tweet.username)" should be changed to "print(tweet.user)"

Geo data?

Is there any way to scrap geo data without using the api? This isn't an issue, it's more of a question. Been searching for a while and I can't seem to find anything.

install error

it installed correctly via pip or from source here but when trying to use cli || in python shell i get this:

from twitterscraper import query_tweets
Traceback (most recent call last):
File "", line 1, in
File "twitterscraper/init.py", line 13, in
from twitterscraper.query import query_tweets
File "twitterscraper/query.py", line 10, in
from twitterscraper.tweet import Tweet
File "twitterscraper/tweet.py", line 3, in
from bs4 import BeautifulSoup
File "/usr/local/lib/python2.7/dist-packages/bs4/init.py", line 30, in
from .builder import builder_registry, ParserRejectedMarkup
File "/usr/local/lib/python2.7/dist-packages/bs4/builder/init.py", line 314, in
from . import _html5lib
File "/usr/local/lib/python2.7/dist-packages/bs4/builder/_html5lib.py", line 70, in
class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder):
AttributeError: 'module' object has no attribute '_base'

i have the twitter api installed also.

Limit is inconsistent with -l flag

I have been running the following command:
twitterscraper trump -l 3 -o tweets.json, which I figured would limit the amount of tweets to 3, according to the documentation.

Why is it that -l is not limiting the tweet download to just 3? I'm assuming this is not intended behavior. I have also tested this with -l at a higher integer, and when set to -l 30, it always downloads 40 tweets.

I'm thinking that this behavior is caused by new tweets being tweeted as the scraper is running? Twitter briefly explains this in this article: https://developer.twitter.com/en/docs/tweets/timelines/guides/working-with-timelines

The output of tweets.json is the following when using --limit 3 (contains 20 tweets):

[{"timestamp": "2017-11-02T18:26:36", "text": "trump owns it now since he gutted the subsidies.", "user": "MoOkonski", "retweets": "0", "replies": "0", "fullname": "Maureenski", "id": "926153585397780480", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "Congress, impeach Trump or resign \u2026http://makeamericagreatagainreally.blogspot.com/2017/10/the-workings-of-donald-j-trumps-mind.html\u00a0\u2026 #Congress #impeachmentpic.twitter.com/lQz5q6ZW5Z", "user": "THIRDSTONE56", "retweets": "0", "replies": "0", "fullname": "THIRD STONE", "id": "926153585750085632", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "#trump ahora es un asesino tambi\u00e9n.", "user": "rikrdotc", "retweets": "0", "replies": "0", "fullname": "Ricardo C", "id": "926153585800482817", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "Donna Brazile: I found 'proof' the DNC rigged the nomination for Hillary Clinton #DrainTheSwamp #Trump POTUS http://www.foxnews.com/politics/2017/11/02/donna-brazile-found-proof-dnc-rigged-nomination-for-hillary-clinton.html\u00a0\u2026", "user": "DavidDoright", "retweets": "0", "replies": "0", "fullname": "D.W.Trump\u00a0\ud83c\uddfa\ud83c\uddf8", "id": "926153586098294785", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "Trump to press for end to North Korea nuclear program on Asia trip: White House http://ift.tt/2z9xKoh\u00a0", "user": "BreakingNewss3", "retweets": "0", "replies": "0", "fullname": "Breaking News", "id": "926153586958053376", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "Nixon used his China trip as distraction to investigations of him. Trump going to Asia; echoes of the same or misdirect to a deeper issue.", "user": "TalkinToU", "retweets": "0", "replies": "0", "fullname": "TalkinToU", "id": "926153587268263936", "likes": "0"}, {"timestamp": "2017-11-02T18:26:38", "text": "George Papadopoulos was much more than what Trump says he was. https://twitter.com/SethAbramson/status/925923595079045120\u00a0\u2026", "user": "Resistacat", "retweets": "0", "replies": "0", "fullname": "Dee Ramee", "id": "926153592427466753", "likes": "0"}, {"timestamp": "2017-11-02T18:26:38", "text": "Mysterious Trump backer Mercer stepping down at fund, selling Breitbart stake. #Trump #Breibarthttps://www.cnbc.com/2017/11/02/billionaire-trump-backer-robert-mercer-to-step-down-from-hedge-fund.html\u00a0\u2026", "user": "PSuiteNetwork", "retweets": "0", "replies": "0", "fullname": "John Cutler", "id": "926153593635459072", "likes": "0"}, {"timestamp": "2017-11-02T18:26:38", "text": "This is far from over. Wait for it. And the collusion won't be over the election it will be over Trump's shady business dealings in Russia", "user": "HarryJoachim", "retweets": "0", "replies": "0", "fullname": "Harry Joachim", "id": "926153594939871234", "likes": "0"}, {"timestamp": "2017-11-02T18:26:38", "text": "Donna Brazil confession: Trump & Bernie were right, the DNC rigged the nomination for Hillary, big league!!\n\nhttps://townhall.com/tipsheet/guybenson/2017/11/02/donna-brazile-trump-and-bernie-were-right-the-dnc-rigged-it-for-hillary-big-league-n2403847\u00a0\u2026", "user": "LovToRideMyTrek", "retweets": "0", "replies": "0", "fullname": "BOYCOTT HOLLYWOOD\u00a0\ud83c\udf83", "id": "926153595346616327", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Why Harry Belafonte's Warning About Trump Is Important Now More Than Ever. Read here: http://allthat.tv/posts/why-harry-belafonte-s-warning-about-trump-is-important-now-more-than-ever\u00a0\u2026", "user": "ArmChairPundt", "retweets": "0", "replies": "0", "fullname": "Lachelle", "id": "926153596340649984", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Thank Trump for that", "user": "DennisG_Shea", "retweets": "0", "replies": "0", "fullname": "Dennis Shea", "id": "926153596567203841", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "GREAT AGAIN: POTUS Trump Announces $100 Billion Company\u2019s Return To USA (VIDEO)\nhttps://goo.gl/SaF4Us\u00a0\n\nNovember 2, 2017\nby Joshua ...pic.twitter.com/dL0nG1oOT8", "user": "warfarenews", "retweets": "0", "replies": "0", "fullname": "Warfare Web", "id": "926153596629950464", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Time To Turn The Channel. I Can Only Handle So Much In One Day Of Trump & The Counterfeit Assholes Surrounding Him! LIES-LIES-LIES!!", "user": "Brokenknee1Jim", "retweets": "0", "replies": "0", "fullname": "James", "id": "926153597427113984", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Trump doesn\u2019t really want you to know Obamacare enrollment just started -- By @svdate https://www.huffingtonpost.com/entry/trump-obamacare-enrollment_us_59fa3adfe4b01b47404810d0?ncid=engmodushpmg00000004\u00a0\u2026 via @HuffPostPol", "user": "michaellamperd", "retweets": "0", "replies": "0", "fullname": "Mick", "id": "926153597489881088", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "The Trump-Russia dossier cost $168,000, not $12 million, like president claimed http://www.newsweek.com/trump-dossier-cost-millions-699816\u00a0\u2026", "user": "XtyMiller", "retweets": "0", "replies": "0", "fullname": "Kilikina", "id": "926153598140014592", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "MOMENTS AGO: Pres. Trump: \"Congress must end chain migration so that we can have a system that is security based, not the way it is now.\"...", "user": "The_News_Corner", "retweets": "0", "replies": "0", "fullname": "Ok", "id": "926153598202912769", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Trump Is Quietly Deregulating All the Things | Brittany Hunter https://fee.org/articles/trump-is-quietly-deregulating-all-the-things/\u00a0\u2026 via @feeonline", "user": "badcraigsnews", "retweets": "0", "replies": "0", "fullname": "Badcraigsnews", "id": "926153598433660928", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Trump to press for end to North Korea nuclear program on Asia trip: White House http://ift.tt/2z7GXgZ\u00a0", "user": "_politic_us_", "retweets": "0", "replies": "0", "fullname": "Audrey", "id": "926153598437818368", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "House Democrats file lawsuit over access to Trump hotel documents - Politico https://www.politico.com/story/2017/11/02/trump-hotel-documents-lawsuit-244455\u00a0\u2026", "user": "PS641600", "retweets": "0", "replies": "0", "fullname": "PeterS", "id": "926153598446301184", "likes": "0"}]

Syntax error: invalid character in identifier line 9

Successfully installed twitterscraper in Python36 but I get the above message from the CMD prompt indicating a problem with the filename.
I think that it is because there is no data to store in the file as there is nothing onscreen from the "print(tweet)" in line 6

Please help (python novice)

John

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.