taspinar / twitterscraper Goto Github PK

View Code? Open in Web Editor NEW

2.4K 87.0 580.0 1.26 MB

Scrape Twitter for Tweets

License: MIT License

Python 99.54% Dockerfile 0.46%

twitterscraper's Introduction

Backers

Thank you to all our backers! 🙏 [Become a backer]

Synopsis

A simple script to scrape Tweets using the Python package requests to retrieve the content and Beautifulsoup4 to parse the retrieved content.

1. Motivation

Twitter has provided REST API's which can be used by developers to access and read Twitter data. They have also provided a Streaming API which can be used to access Twitter Data in real-time.

Most of the software written to access Twitter data provide a library which functions as a wrapper around Twitter's Search and Streaming API's and are therefore constrained by the limitations of the API's.

With Twitter's Search API you can only send 180 Requests every 15 minutes. With a maximum number of 100 tweets per Request, you can mine 72 tweets per hour (4 x 180 x 100 =72) . By using TwitterScraper you are not limited by this number but by your internet speed/bandwith and the number of instances of TwitterScraper you are willing to start.

One of the bigger disadvantages of the Search API is that you can only access Tweets written in the past 7 days. This is a major bottleneck for anyone looking for older data. With TwitterScraper there is no such limitation.

Per Tweet it scrapes the following information:

Tweet-id
Tweet-url
Tweet text
Tweet html
Links inside Tweet
Hashtags inside Tweet
Image URLS inside Tweet
Video URL inside Tweet
Tweet timestamp
Tweet Epoch timestamp
Tweet No. of likes
Tweet No. of replies
Tweet No. of retweets
Username
User Full Name / Screen Name
User ID
Tweet is an reply to
Tweet is replied to
List of users Tweet is an reply to
Tweet ID of parent tweet

In addition it can scrape for the following user information:

Date user joined
User location (if filled in)
User blog (if filled in)
User No. of tweets
User No. of following
User No. of followers
User No. of likes
User No. of lists
User is verified

2. Installation and Usage

To install twitterscraper:

(sudo) pip install twitterscraper

or you can clone the repository and in the folder containing setup.py

python setup.py install

If you prefer more isolation you can build a docker image

docker build -t twitterscraper:build .

and run your container with:

docker run --rm -it -v/<PATH_TO_SOME_SHARED_FOLDER_FOR_RESULTS>:/app/data twitterscraper:build <YOUR_QUERY>

2.2 The CLI

You can use the command line application to get your tweets stored to JSON right away. Twitterscraper takes several arguments:

-h or --help Print out the help message and exits.
-l or --limit TwitterScraper stops scraping when at least the number of tweets indicated with --limit is scraped. Since tweets are retrieved in batches of 20, this will always be a multiple of 20. Omit the limit to retrieve all tweets. You can at any time abort the scraping by pressing Ctrl+C, the scraped tweets will be stored safely in your JSON file.
--lang Retrieves tweets written in a specific language. Currently 30+ languages are supported. For a full list of the languages print out the help message.
-bd or --begindate Set the date from which TwitterScraper should start scraping for your query. Format is YYYY-MM-DD. The default value is set to 2006-03-21. This does not work in combination with --user.
-ed or --enddate Set the enddate which TwitterScraper should use to stop scraping for your query. Format is YYYY-MM-DD. The default value is set to today. This does not work in combination with --user.
-u or --user Scrapes the tweets from that users' profile page. This also includes all retweets by that user. See section 2.2.4 in the examples below for more information.
--profiles : Twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets. The results will be saved in the file userprofiles<filename>.
-p or --poolsize Set the number of parallel processes TwitterScraper should initiate while scraping for your query. Default value is set to 20. Depending on the computational power you have, you can increase this number. It is advised to keep this number below the number of days you are scraping. For example, if you are scraping from 2017-01-10 to 2017-01-20, you can set this number to a maximum of 10. If you are scraping from 2016-01-01 to 2016-12-31, you can increase this number to a maximum of 150, if you have the computational resources. Does not work in combination with --user.
-o or --output Gives the name of the output file. If no output filename is given, the default filename 'tweets.json' or 'tweets.csv' will be used.
-c or --csv Write the result to a CSV file instead of a JSON file.
-d or --dump: With this argument, the scraped tweets will be printed to the screen instead of an outputfile. If you are using this argument, the --output argument doe not need to be used.
-ow or --overwrite: With this argument, if the output file already exists it will be overwritten. If this argument is not set (default) twitterscraper will exit with the warning that the output file already exists.
-dp or --disableproxy: With this argument, proxy servers are not used when scrapping tweets or user profiles from twitter.

2.2.1 Examples of simple queries

Below is an example of how twitterscraper can be used:

twitterscraper Trump --limit 1000 --output=tweets.json

twitterscraper Trump -l 1000 -o tweets.json

twitterscraper Trump -l 1000 -bd 2017-01-01 -ed 2017-06-01 -o tweets.json

2.2.2 Examples of advanced queries

You can use any advanced query Twitter supports. An advanced query should be placed within quotes, so that twitterscraper can recognize it as one single query.

Here are some examples:

search for the occurence of 'Bitcoin' or 'BTC': twitterscraper "Bitcoin OR BTC" -o bitcoin_tweets.json -l 1000
search for the occurence of 'Bitcoin' and 'BTC': twitterscraper "Bitcoin AND BTC" -o bitcoin_tweets.json -l 1000
search for tweets from a specific user: twitterscraper "Blockchain from:VitalikButerin" -o blockchain_tweets.json -l 1000
search for tweets to a specific user: twitterscraper "Blockchain to:VitalikButerin" -o blockchain_tweets.json -l 1000
search for tweets written from a location: twitterscraper "Blockchain near:Seattle within:15mi" -o blockchain_tweets.json -l 1000

You can construct an advanced query on Twitter Advanced Search or use one of the operators shown on this page. Also see Twitter's Standard operators

2.2.3 Examples of scraping user pages ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can also scraped all tweets written or retweeted by a specific user. This can be done by adding the boolean argument -u / --user argument. If this argument is used, the search term should be equal to the username.

Here is an example of scraping a specific user:

twitterscraper realDonaldTrump --user -o tweets_username.json

This does not work in combination with -p, -bd, or -ed.

The main difference with the example "search for tweets from a specific user" in section 2.2.2 is that this method really scrapes all tweets from a profile page (including retweets). The example in 2.2.2 scrapes the results from the search page (excluding retweets).

2.3 From within Python

You can easily use TwitterScraper from within python:

from twitterscraper import query_tweets

if __name__ == '__main__':
    list_of_tweets = query_tweets("Trump OR Clinton", 10)

    #print the retrieved tweets to the screen:
    for tweet in query_tweets("Trump OR Clinton", 10):
        print(tweet)

    #Or save the retrieved tweets to file:
    file = open(“output.txt”,”w”)
    for tweet in query_tweets("Trump OR Clinton", 10):
        file.write(str(tweet.text.encode('utf-8')))
    file.close()

2.3.1 Examples of Python Queries

Query tweets from a given URL:
Parameters:

query: The query search parameter of url

lang: Language of queried url

pos: Parameter passed for where to start looking in url

retry: Number of times to retry if error
query_single_page(query, lang, pos, retry=50, from_user=False, timeout=60)
Query all tweets that match qeury:
Parameters:

query: The query search parameter

limit: Number of tweets returned

begindate: Start date of query

enddate: End date of query

poolsize: Tweets per poolsize

lang: Language of query
query_tweets('query', limit=None, begindate=dt.date.today(), enddate=dt.date.today(), poolsize=20, lang='')
Query tweets from a specific user:
Parameters:

user: Twitter username

limit: Number of tweets returned
query_tweets(user, limit=None)

2.4 Scraping for retweets ----------------------

A regular search within Twitter will not show you any retweets. Twitterscraper therefore does not contain any retweets in the output.

To give an example: If user1 has written a tweet containing #trump2020 and user2 has retweetet this tweet, a search for #trump2020 will only show the original tweet.

The only way you can scrape for retweets is if you scrape for all tweets of a specific user with the -u / --user argument.

2.5 Scraping for User Profile information ----------------------By adding the argument --profiles twitterscraper will in addition to the tweets, also scrape for the profile information of the users who have written these tweets. The results will be saved in the file "userprofiles<filename>".

Try not to use this argument too much. If you have already scraped profile information for a set of users, there is no need to do it again :) It is also possible to scrape for profile information without scraping for tweets. Examples of this can be found in the examples folder.

3. Output

All of the retrieved Tweets are stored in the indicated output file. The contents of the output file will look like:

[{"fullname": "Rupert Meehl", "id": "892397793071050752", "likes": "1", "replies": "0", "retweets": "0", "text": "Latest: Trump now at lowest Approval and highest Disapproval ratings yet. Oh, we're winning bigly here ...\n\nhttps://projects.fivethirtyeight.com/trump-approval-ratings/?ex_cid=rrpromo\u00a0\u2026", "timestamp": "2017-08-01T14:53:08", "user": "Rupert_Meehl"}, {"fullname": "Barry Shapiro", "id": "892397794375327744", "likes": "0", "replies": "0", "retweets": "0", "text": "A former GOP Rep quoted this line, which pretty much sums up Donald Trump. https://twitter.com/davidfrum/status/863017301595107329\u00a0\u2026", "timestamp": "2017-08-01T14:53:08", "user": "barryshap"}, (...)
]

3.1 Opening the output file

In order to correctly handle all possible characters in the tweets (think of Japanese or Arabic characters), the output is saved as utf-8 encoded bytes. That is why you could see text like "u30b1 u30f3 u3055 u307e u30fe ..." in the output file.

What you should do is open the file with the proper encoding:

Example of output with Japanese characters

3.1.2 Opening into a pandas dataframe ---------------------------

After the file has been opened, it can easily be converted into a `pandas` DataFrame

import pandas as pd
df = pd.read_json('tweets.json', encoding='utf-8')

twitterscraper's People

Contributors

Stargazers

Watchers

Forkers

masdevid sils errakeshpd crishernandezmaps rainfireliang gayathrisampath1 schollz wangchen1117 sahwar wangjunbo571 gustavoaires bskrishna77 matthewstidham rsesha ashgreat ricky-wilson samanthaklee bluegoo192 gridl dariolourenco gnanam336 abhinavsohani gyuhwung ningweiii darthbhyrava allurivijay polinas123 shokesu teb5240 alegomes katronai l3r1nmax indrajitharidas mstei4176 damellp adupuis2 aneliram89 blogedwin adhiravishankar tacnayn chubbymaggie infinitiii geapoch ikario404 halprez lisandro11 tonkpo mvdwaeter milosmladenovic5 dkennedy778 may215 calclavia d0tn3t shubhamdipt citronicgearon kolliparap weeshlow jhchiu1 kimmorsha cpl tumtumbear wonseokch plightt moodlezoup youjin-c acesounderglass arp12 pavanjuturu grevutiu-gabriel vikaskodag2 blue2161 abdzrahim oattie taniajacob atuljha23 jellyjr bayupaoh alioh rajat-np bradmonk christinazxy daewonseo ewertonsantiago fo0nikens jicksonp hinsencamp hieuqtran lapp0 etemiz a-moss dkoguciuk mittalakshay6 ecmyhre dfirgeek julianmack nileshjorwar beliscime selvakarthik21 mirving9 jacklaurencegaray

twitterscraper's Issues

FakeUserAgentError: Error occurred during getting browser

Running twitterscraper, I ran into this error using the example given in the readme twitterscraper Trump%20since%3A2017-01-03%20until%3A2017-01-04 -o tweets.json

I was running a version from March and then upgraded to the latest master.zip but I still got the same error... Any ideas on how to resolve this? I'm running Ubuntu 16.04...

Traceback (most recent call last):
  File "/usr/local/bin/twitterscraper", line 9, in <module>
    load_entry_point('twitterscraper==0.3.1', 'console_scripts', 'twitterscraper')()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 542, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2569, in load_entry_point
    return ep.load()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2229, in load
    return self.resolve()
  File "/usr/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 2235, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "build/bdist.linux-x86_64/egg/twitterscraper/__init__.py", line 13, in <module>
  File "build/bdist.linux-x86_64/egg/twitterscraper/query.py", line 14, in <module>
  File "/usr/local/lib/python2.7/dist-packages/fake_useragent/fake.py", line 139, in __getattr__
    raise FakeUserAgentError('Error occurred during getting browser')  # noqa
fake_useragent.errors.FakeUserAgentError: Error occurred during getting browser

Python 3 support

Would be nice to be able to use this in python 3.

(pythonclock.org :))

Begin and end dates are no longer working after latest commit

Returns 503 error when trying to specify dates

Issues with since and until in commandline

twitterscraper "%24PEP"%20since%3A2017-10-05 -o pep.out

this works, but when running it

twitterscraper "%24PEP"%20since%3A2017-10-05%20until%3A2017-10-05 -o pep.out

it doesnt work.

Ie. i want to limit the results to only one single day. wont' work.

Is there a way to return tweets using the location parameter?

Am I right in thinking that this scraper only looks at q= to return results and therefore you cannot pass a location?

ImportError: No module named 'tweet'

I get the following error when trying to use this.
Installed in a venv via pip

Traceback (most recent call last):
  File "collector.py", line 1, in <module>
    import twitterscraper
  File "/home/m0hawk/Documents/dev/TUHH/testvenv/lib/python3.5/site-packages/twitterscraper/__init__.py", line 13, in <module>
    from twitterscraper.query import query_tweets
  File "/home/m0hawk/Documents/dev/TUHH/testvenv/lib/python3.5/site-packages/twitterscraper/query.py", line 14, in <module>
    from tweet import Tweet
ImportError: No module named 'tweet'

Use proper dates

We're currently extracting human readable timestamps, however there's a property data-time-ms in the span within the a which contains it: <span class="_timestamp js-short-timestamp " data-aria-label-part="last" data-time="1476057559" data-time-ms="1476057559000" data-long-form="true">Oct 9</span> - parsing the other string into proper date objects is almost impossible, they sometimes contain AM/PM, sometimes not, sometimes dots here and there, sometimes not, occasionally I get localized months...

Parralel scraping doesn't seem to work

I did a few logging modifications and if you checkout https://github.com/sils/twitterscraper/tree/sils/parallel and scrape for test or something like that you'll get like 60ish tweets sometimes for some parts of months which seems rather impossible (and doesn't check out if you put in the advanced query into the search UI)

@taspinar if you have any idea that'd help a lot :/

Encoding issue when applying the script to non-English language

Dear author
Thanks very much for your kind work! I am a beginner on python programming and hope it would not trouble you too much.
Here is the problem that I am gonna apply the script on mining non-English text (for twitter advanced search page), such as "戦う", while non-English text in the output file is always displayed as unicode like "'\xe7\x8e\xb2\xe9\ ... ..."
even when typing the command "print(tweet.text.encode('utf-8'))"(or with other encode), the output is still the same.
I am wondering if there is some specific measures to display the non-English text correctly?
Thanks!

Missing the Output JSON file ...

This question might sound silly, but I am able to use TwitterScraper successfully (with the command twitterscraper “” --output=tweets.json”. but I am unable to retrieve my json file (Logging shows that data is being collected: Example :
INFO: Got 137 tweets (20 new).
INFO: Got 157 tweets (20 new).
INFO: Got 177 tweets (19 new).
INFO: Got 196 tweets (19 new).
¸INFO: Got 215 tweets (17 new).
INFO: Got 232 tweets (20 new).
INFO: Got 252 tweets (19 new).
)
Specifying the exact path /Users/blahblah/tweets.JSON did not make a difference.
What am I missing? Thanks for your help in advance,

Source parameter is not passed accurately to the script

When running twitterscraper from command line, the source parameter is not accurately passed to the script if used with apostrophe.
Example:
#news AND source:"Twitter for Android"
twitterscraper %23news%20AND%20source%3A"Twitter%20for%20Android" --output=tweets_new_Android.json

tweets_new_Android.json is empty, but https://twitter.com/search?q=%23news%20AND%20source%3A%22Twitter%20for%20Android%22&src=typd shows results.
it works for sources without apostrophe:
#news AND source:"Tweetdeck"
twitterscraper %23news%20AND%20source%3A"Tweetdeck" --output=tweets_new_Tweetdeck.json

Likes, Retweets, Replies not being parsed if (> 999)

Tweet data for these fields is not being properly parsed if the values exceed 999.

I suspect that it relates to the fact that Twitter displays those values with letters in them. eg, "1.1k" instead of 1100.

In any case, Twitterscraper returns those values as 0.

Output file created even if no tweets are found

This results in a bunch of empty files and incrementally increasing filenames, can be annoying for testing.

How to scrape users' tweets

I'm trying to extract specified users' tweets.
By using this command line: twitterscraper Trump --limit 100 --output=tweets.json

it just extracts all twists that the person name is mentioned in it instead of the users' tweets.

My question, how can extract all specified users' tweets
Thank you...

Allow getting lots of tweets

Apparently after 100000 tweets or so twitter stops serving new pages.

control-C Does not seem to stop Parser execution or save results

If I control C out of the command line execution the program does not seem to save its results anywhere. The program also continues its execution on a second iteration, which is not always desired. I ran a large search last night on separate machines and neither of them saved their search data when control-c was used

AttributeError on module requests

Hello @taspinar

I just found a bug while tweet scraping. When my connection is unstable I got the error message like following :

ERROR:root:An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\twitterscraper\query.py", line 93, in query_tweets_once
pos is None
File "C:\Python27\lib\site-packages\twitterscraper\query.py", line 53, in query_single_page
except requests.exception.ConnectionError as e:
AttributeError: 'module' object has no attribute 'exception'

Solved, just need to updgrade.

Add option to query by language

Given that this is just parameter in the Twitter API, it should be easy to do, and frustrating that it isn't already available

Number of tweets in final JSON file much smaller than reported during run

So I ran the scraper for a tweeting period of around a year, with the limit of 40.000, so:

twitterscraper "%23bitcoin AND %23bubble since%3A2016-09-01 until%3A2017-10-10&src=typd" -l 40000 -o bitcoinbubble.json

While running it counted all the way up to 40 thousand:
INFO: Got 39953 tweets (18 new).
INFO: Got 39971 tweets (19 new).
INFO: Got 39990 tweets (17 new).
INFO: Got tweets ranging from 2017-09-08 to 2017-10-09

But when i load the json file, it only contains 1528 tweets - what explains this?

Add ability to output to stdout rather than output to file

Reading the stdout of a command is much more efficient when handling a lot of requests, rather than taxing the server memory by creating many output json files. I believe that having an option to output to the console as stdout rather than outputting to a file would be a great feature that would expand the way that people can use this project.

More attributes

Is it possible to get more attributes like number of retweets, replies, and favorites? This is a feature request I guess

UTF-8 self.writer.writerow(post) issue

Hello !

I'm trying to scrap every tweets from an account. My script is quite simple :

_!/usr/bin/env python
encoding: utf-8

from twitterscraper import TwitterScraper

topic = ""
cible = "username"
filename = 'username_tweets.csv'
scraper = TwitterScraper.Scraper(topic, 21000, authors=cible, filename = filename)
scraper.scrape()_

It works for hundreds of tweets, but then I've got these error :

Traceback (most recent call last):
File "myscript.py", line 10, in
scraper.scrape()
File "/usr/local/lib/python2.7/dist-packages/twitterscraper/TwitterScraper.py", line 148, in scrape
self.write(post)
File "/usr/local/lib/python2.7/dist-packages/twitterscraper/TwitterScraper.py", line 136, in write
self.writer.writerow(post)

(Yes, I'm using python 2.7, don't know if the problem came from here or not)

Thanks in advance

Warning: this package expressly violates Twitter's TOS

https://twitter.com/en/tos

"scraping the Services without the prior consent of Twitter is expressly prohibited"

returning usernames, not tweets

import twitterscraper as ts
'usr='kingjames'
for tweet in ts.query_tweets(usr,10)[:10]:
    print(tweet.user.encode('utf-8'))

#out:
b'Rypuur'
b'Powperezdiez'
b'joey_a_george'
b'mikey_rakkar'
b'yarapgv'
b'V_Nasty10'
b'downtownbrownxx'
b'DeclanJoyce'
b'atnissaa'
b'WestifiedMJ'

Accessing JSON file with stored tweets

After carrying out the CLi search, how do I access the JSON file which the tweets are stored in?

Make script name lower case

Casing only introduces problems and confusion, also the package name on pypi is lowercase.

Basic Example/Documentation in Python file

Hi there,

It would be great, if there was a basic example that does the following in a Python script:

set a query for a specific time
save the data on the hard drive

Probably this is just me being new to Python, but a general documentation with a brief description for each functionality would also be nice.

Thanks in advance!

JSONify tweets properly

the namedtuple just jsonifies them as tuples, would be better to be more dict like and have the member names as keys in the outputted JSON

Zero Result

Hello @taspinar

Recently I run twitterscraper from my command line.

C:\Python27\Scripts\twitterscraper Telkomsel -o tweets.json

Unfortunately, resulting zero result. But if I add another keyword like Telkomsel mengecewakan resulting the tweet related keyword.

In the other hand, if I write

C:\Python27\Scripts\twitterscraper Trump -o tweets.json it runs very well.

Why it happens ?

This is weird, I checked Telkomsel on Twitter, sometimes it reloads and sometimes it stucks at all. Is it part of Twitter bug ?

Advance Query

Hello @taspinar

I'm new at programming.

Could you please give an example about an advanced query. In particular scraping by location and specific time.

Thankyou

Advanced query

Docu gives:
"You can use any advanced query twitter supports. Simply compile your query at https://twitter.com/search-advanced."

Lets say I try to get all tweets from user 'username'
I get the url https://twitter.com/search?f=tweets&q=from%3Ausername&src=typd
Which part (if not the whole url) is the query?

Can‘t get data earlier than 12 days ago

Hi Taspinar and Sils,

I was collecting movie data of last year today, it seems like the date issue occurring again, I cannot get the data earlier than 12 days ago :( and I have tried many times. It's as if some sort of notification occurred that enabled Twitter to know I was trying to go back further than 12 days. So how can I solve this problem?

Thank you so much!

Unknown Error

while running TwitterScraper "test" --output tweets.json --all for ~10 minutes

ERROR: An unknown error occurred! Returning tweets gathered so far.
Traceback (most recent call last):
  File "/home/lasse/prog/tie/twitterscraper/twitterscraper/query.py", line 96, in query_tweets_once
    pos is None
  File "/home/lasse/prog/tie/twitterscraper/twitterscraper/query.py", line 46, in query_single_page
    tweets = list(Tweet.from_html(html))
  File "/home/lasse/prog/tie/twitterscraper/twitterscraper/tweet.py", line 34, in from_html
    yield cls.from_soup(tweet)
  File "/home/lasse/prog/tie/twitterscraper/twitterscraper/tweet.py", line 19, in from_soup
    user=tweet.find('span', 'username').text[1:],
AttributeError: 'NoneType' object has no attribute 'text'

AttributeError: Tweet instance has no attribute 'encode'

I'm simply running the code from the readme file and i keep getting errors. Can you help with this error?

And also, this:
twitterscraper Trump -l 100 -o tweets.json
Produces Error:

twitterscraper: Command not found

So whenever I am trying to run this command on my server, it's saying "Command not found". I have installed it in my home directory. Please help. Any help would be appreciated.

except urllib2.HTTPError, e: (invalid syntax)

Hello again !

I've just tried my precedent script in Python 3, and got immediately these error :

File "myscript.py", line 4, in
from twitterscraper import TwitterScraper
File "/usr/local/lib/python3.4/dist-packages/twitterscraper/TwitterScraper.py", line 109
except urllib2.HTTPError, e:
^
SyntaxError: invalid syntax

Maybe it's a naive alternative, but I've discovered recently requests and found this module more powerful than urllib. Here some scraping example with requests !

Scrape tweet url

Seems to be nestled in data-permalink-path, should be an easy scrape

Inconsistent results among multiple runs

I am using twitterscraper to get the replies to some twitter accounts.

I am running the following queries as a test:

to%3Amatteorenzi%20since%3A2017-08-21%20until%3A2017-08-27
to%3Amatteosalvinimi%20since%3A2017-08-21%20until%3A2017-08-27

When performing multiple runs I get a different amount of results each time as below, with left number being the result of first query and right one for the second. Each line is a different run.

544, 4216
386, 4121
295, 4180

Why does this happen? Any way I can prevent it?

How i can get location(from user profile) and geo location of tweet

i'm new to python.
this script works fine. But i also want user location which is given in his profile and geo location.
How i can get these information ?

Error in example code (Readme.md)

Taspinar and Sils, nice job!

A little issue in Readme.md usage example: "print(tweet.username)" should be changed to "print(tweet.user)"

Python 3 Rewrite

Hi,

I did a python 3 rewrite. It's a bit shorter, only takes about 90 LOC, and has a cleaner API IMO. It supports arbitrary queries and I basically got rid of the IO and a lot of other stateful stuff: https://github.com/sils/twitterscraper/blob/sils/auth/scrape_from_author.py

Geo data?

Is there any way to scrap geo data without using the api? This isn't an issue, it's more of a question. Been searching for a while and I can't seem to find anything.

install error

it installed correctly via pip or from source here but when trying to use cli || in python shell i get this:

from twitterscraper import query_tweets
Traceback (most recent call last):
File "", line 1, in
File "twitterscraper/init.py", line 13, in
from twitterscraper.query import query_tweets
File "twitterscraper/query.py", line 10, in
from twitterscraper.tweet import Tweet
File "twitterscraper/tweet.py", line 3, in
from bs4 import BeautifulSoup
File "/usr/local/lib/python2.7/dist-packages/bs4/init.py", line 30, in
from .builder import builder_registry, ParserRejectedMarkup
File "/usr/local/lib/python2.7/dist-packages/bs4/builder/init.py", line 314, in
from . import _html5lib
File "/usr/local/lib/python2.7/dist-packages/bs4/builder/_html5lib.py", line 70, in
class TreeBuilderForHtml5lib(html5lib.treebuilders._base.TreeBuilder):
AttributeError: 'module' object has no attribute '_base'

i have the twitter api installed also.

Error: message "Error occurred during getting browser"

I successfully installed twitterscraper on my notebook, using Linux. But, when I tried to run it, I got the following error message:

"Error occurred during getting browser"

What should I do?

Thanks.

Limit is inconsistent with -l flag

I have been running the following command:
twitterscraper trump -l 3 -o tweets.json, which I figured would limit the amount of tweets to 3, according to the documentation.

Why is it that -l is not limiting the tweet download to just 3? I'm assuming this is not intended behavior. I have also tested this with -l at a higher integer, and when set to -l 30, it always downloads 40 tweets.

I'm thinking that this behavior is caused by new tweets being tweeted as the scraper is running? Twitter briefly explains this in this article: https://developer.twitter.com/en/docs/tweets/timelines/guides/working-with-timelines

The output of tweets.json is the following when using --limit 3 (contains 20 tweets):

[{"timestamp": "2017-11-02T18:26:36", "text": "trump owns it now since he gutted the subsidies.", "user": "MoOkonski", "retweets": "0", "replies": "0", "fullname": "Maureenski", "id": "926153585397780480", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "Congress, impeach Trump or resign \u2026http://makeamericagreatagainreally.blogspot.com/2017/10/the-workings-of-donald-j-trumps-mind.html\u00a0\u2026 #Congress #impeachmentpic.twitter.com/lQz5q6ZW5Z", "user": "THIRDSTONE56", "retweets": "0", "replies": "0", "fullname": "THIRD STONE", "id": "926153585750085632", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "#trump ahora es un asesino tambi\u00e9n.", "user": "rikrdotc", "retweets": "0", "replies": "0", "fullname": "Ricardo C", "id": "926153585800482817", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "Donna Brazile: I found 'proof' the DNC rigged the nomination for Hillary Clinton #DrainTheSwamp #Trump POTUS http://www.foxnews.com/politics/2017/11/02/donna-brazile-found-proof-dnc-rigged-nomination-for-hillary-clinton.html\u00a0\u2026", "user": "DavidDoright", "retweets": "0", "replies": "0", "fullname": "D.W.Trump\u00a0\ud83c\uddfa\ud83c\uddf8", "id": "926153586098294785", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "Trump to press for end to North Korea nuclear program on Asia trip: White House http://ift.tt/2z9xKoh\u00a0", "user": "BreakingNewss3", "retweets": "0", "replies": "0", "fullname": "Breaking News", "id": "926153586958053376", "likes": "0"}, {"timestamp": "2017-11-02T18:26:36", "text": "Nixon used his China trip as distraction to investigations of him. Trump going to Asia; echoes of the same or misdirect to a deeper issue.", "user": "TalkinToU", "retweets": "0", "replies": "0", "fullname": "TalkinToU", "id": "926153587268263936", "likes": "0"}, {"timestamp": "2017-11-02T18:26:38", "text": "George Papadopoulos was much more than what Trump says he was. https://twitter.com/SethAbramson/status/925923595079045120\u00a0\u2026", "user": "Resistacat", "retweets": "0", "replies": "0", "fullname": "Dee Ramee", "id": "926153592427466753", "likes": "0"}, {"timestamp": "2017-11-02T18:26:38", "text": "Mysterious Trump backer Mercer stepping down at fund, selling Breitbart stake. #Trump #Breibarthttps://www.cnbc.com/2017/11/02/billionaire-trump-backer-robert-mercer-to-step-down-from-hedge-fund.html\u00a0\u2026", "user": "PSuiteNetwork", "retweets": "0", "replies": "0", "fullname": "John Cutler", "id": "926153593635459072", "likes": "0"}, {"timestamp": "2017-11-02T18:26:38", "text": "This is far from over. Wait for it. And the collusion won't be over the election it will be over Trump's shady business dealings in Russia", "user": "HarryJoachim", "retweets": "0", "replies": "0", "fullname": "Harry Joachim", "id": "926153594939871234", "likes": "0"}, {"timestamp": "2017-11-02T18:26:38", "text": "Donna Brazil confession: Trump & Bernie were right, the DNC rigged the nomination for Hillary, big league!!\n\nhttps://townhall.com/tipsheet/guybenson/2017/11/02/donna-brazile-trump-and-bernie-were-right-the-dnc-rigged-it-for-hillary-big-league-n2403847\u00a0\u2026", "user": "LovToRideMyTrek", "retweets": "0", "replies": "0", "fullname": "BOYCOTT HOLLYWOOD\u00a0\ud83c\udf83", "id": "926153595346616327", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Why Harry Belafonte's Warning About Trump Is Important Now More Than Ever. Read here: http://allthat.tv/posts/why-harry-belafonte-s-warning-about-trump-is-important-now-more-than-ever\u00a0\u2026", "user": "ArmChairPundt", "retweets": "0", "replies": "0", "fullname": "Lachelle", "id": "926153596340649984", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Thank Trump for that", "user": "DennisG_Shea", "retweets": "0", "replies": "0", "fullname": "Dennis Shea", "id": "926153596567203841", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "GREAT AGAIN: POTUS Trump Announces $100 Billion Company\u2019s Return To USA (VIDEO)\nhttps://goo.gl/SaF4Us\u00a0\n\nNovember 2, 2017\nby Joshua ...pic.twitter.com/dL0nG1oOT8", "user": "warfarenews", "retweets": "0", "replies": "0", "fullname": "Warfare Web", "id": "926153596629950464", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Time To Turn The Channel. I Can Only Handle So Much In One Day Of Trump & The Counterfeit Assholes Surrounding Him! LIES-LIES-LIES!!", "user": "Brokenknee1Jim", "retweets": "0", "replies": "0", "fullname": "James", "id": "926153597427113984", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Trump doesn\u2019t really want you to know Obamacare enrollment just started -- By @svdate https://www.huffingtonpost.com/entry/trump-obamacare-enrollment_us_59fa3adfe4b01b47404810d0?ncid=engmodushpmg00000004\u00a0\u2026 via @HuffPostPol", "user": "michaellamperd", "retweets": "0", "replies": "0", "fullname": "Mick", "id": "926153597489881088", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "The Trump-Russia dossier cost $168,000, not $12 million, like president claimed http://www.newsweek.com/trump-dossier-cost-millions-699816\u00a0\u2026", "user": "XtyMiller", "retweets": "0", "replies": "0", "fullname": "Kilikina", "id": "926153598140014592", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "MOMENTS AGO: Pres. Trump: \"Congress must end chain migration so that we can have a system that is security based, not the way it is now.\"...", "user": "The_News_Corner", "retweets": "0", "replies": "0", "fullname": "Ok", "id": "926153598202912769", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Trump Is Quietly Deregulating All the Things | Brittany Hunter https://fee.org/articles/trump-is-quietly-deregulating-all-the-things/\u00a0\u2026 via @feeonline", "user": "badcraigsnews", "retweets": "0", "replies": "0", "fullname": "Badcraigsnews", "id": "926153598433660928", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "Trump to press for end to North Korea nuclear program on Asia trip: White House http://ift.tt/2z7GXgZ\u00a0", "user": "_politic_us_", "retweets": "0", "replies": "0", "fullname": "Audrey", "id": "926153598437818368", "likes": "0"}, {"timestamp": "2017-11-02T18:26:39", "text": "House Democrats file lawsuit over access to Trump hotel documents - Politico https://www.politico.com/story/2017/11/02/trump-hotel-documents-lawsuit-244455\u00a0\u2026", "user": "PS641600", "retweets": "0", "replies": "0", "fullname": "PeterS", "id": "926153598446301184", "likes": "0"}]

Add CI and at least like one integration test

So we see everything works on the versions we want to support, should be easy to do.

Date issue occurring again - Cannot pass in a date earlier than one week ago

If there's too many tweets for one day on non parallel scraping it scrapes the same day forever again and again

And there's no way we can get all the tweets for that day I presume.

Syntax error: invalid character in identifier line 9

Successfully installed twitterscraper in Python36 but I get the above message from the CMD prompt indicating a problem with the filename.
I think that it is because there is no data to store in the file as there is nothing onscreen from the "print(tweet)" in line 6

Please help (python novice)

John

Extract tweets using user handler and the problem with number of retweets and likes?

Sir
I am getting the data but the number of retweets and likes are always shown as zero. I wonder why is it so! And also I wanted to know if there is a way to extract tweets of a specific person using username?

taspinar / twitterscraper Goto Github PK

twitterscraper's Introduction

Backers

Sponsors

Synopsis

1. Motivation

2. Installation and Usage

2.2 The CLI

2.2.1 Examples of simple queries

2.2.2 Examples of advanced queries

2.3 From within Python

2.3.1 Examples of Python Queries

3. Output

3.1 Opening the output file

twitterscraper's People

Contributors

Stargazers

Watchers

Forkers

twitterscraper's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs