minimaxir / facebook-page-post-scraper Goto Github PK

Data scraper for Facebook Pages, and also code accompanying the blog post How to Scrape Data From Facebook Page Posts for Statistical Analysis

Python 100.00%

facebook-page-post-scraper's Introduction

Facebook Page Post Scraper

UPDATE December 2017: Due to a bug on Facebook's end, using this scraper will only return a very small subset of posts (5-10% of posts) over a limited timeframe. Since Facebook now owns CrowdTangle, the (paid) canonical source of historical Facebook data, Facebook doesn't have an incentive to fix the linked bug.

On December 12th, a Facebook engineer commented that they are developing a new endpoint for scraping posts chronologically. I will refactor this script once that happens. Until then, there likely will not be any PRs accepted.

A tool for gathering all the posts and comments of a Facebook Page (or Open Facebook Group) and related metadata, including post message, post links, and counts of each reaction on the post. All this data is exported as a CSV, able to be imported into any data analysis program like Excel.

The purpose of the script is to gather Facebook data for semantic analysis, which is greatly helped by the presence of high-quality Reaction data. Here's quick examples of a potential Facebook Reaction data visualization using data from CNN's Facebook page:

Usage

Scrape Posts From Public Page

The Page data scraper is implemented as a Python 2/3 script in get_fb_posts_fb_page.py; fill in the App ID and App Secret of a Facebook app you control (I strongly recommend creating an app just for this purpose) and the Page ID of the Facebook Page you want to scrape at the beginning of the file. Then run the script by cd into the directory containing the script, then running python get_fb_posts_fb_page.py or python3 get_fb_posts_fb_page.py.

Scrape Posts from Open Group

To get data from an Open Group, use the get_fb_posts_fb_group.py script with the App ID and App Secret filled in the same way. However, the group_id is a numeric ID. For groups without a custom username, the ID will be in the address bar; for groups with custom usernames, to get the ID, do a View Source on the Group Page, search for the phrase "entity_id", and use the number to the right of that field. For example, the group_id of Hackathon Hackers is 759985267390294.

Scrape Comments From Page/Group Posts

To scrape all the user comments from the posts, create a CSV using either of the above scripts, then run the get_fb_comments_from_fb.py script, specifying the Page/Group as the file_id. The output includes the original status_id where the comment is located so you can map the comment to the original Post with a JOIN or VLOOKUP, and also a parent_id if the comment is a reply to another comment.

Keep in mind that large pages such as CNN have millions of comments, so be careful! (scraping throughput is approximately 87k comments/hour)

Privacy

This scraper can only scrape public Facebook data which is available to anyone, even those who are not logged into Facebook. No personally-identifiable data is collected in the Page variant; the Group variant does collect the name of the author of the post, but that data is also public to non-logged-in users. Additionally, the script only uses officially-documented Facebook API endpoints without circumventing any rate-limits.

Note that this script, and any variant of this script, cannot be used to scrape data from user profiles. (and the Facebook API specifically disallows this use case!)

Known Issues

UTF-16 text (CJK) sometimes fails.
GIFs in comments will not appear for an App access_token. (it requires a User access_token for no apparent reason).

Maintainer

Max Woolf (@minimaxir)

Max's open-source projects are supported by his Patreon. If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.

For more information on how the script was originally created, and some tips on how to create similar scrapers yourself, see my blog post How to Scrape Data From Facebook Page Posts for Statistical Analysis.

Credits

Peeter Tintis, whose fork of this repo implements code for finding separate reaction counts per this Stack Overflow answer.

Marco Goldin for the Python 3.5 fork.

License

MIT

If you do find this script useful, a link back to this repository would be appreciated. Thanks!

facebook-page-post-scraper's People

Contributors

Stargazers

Watchers

Forkers

boniface elpom55 seungjulee international manugarri mdxe five2one mattwilhalme salmcdonagh sdoering noahmanion rajoria308 murali-munna womghei ilyes14 kartechbabu din1993 tdtshafer ksarnaik tangyfruits alabarga jasmine-lily tungda sachingaaurav2007 smooth711 yanoak samydahmani fredericktf dharmeshpandav adrielvieira rahulroxx rajanand23 themodernturing jrosen48 hashresearch appswizmb raikel003 kevark aknn zjihad aybuketurker aromis mebranemlm ankit96 jbanegas rafaelbraga-kribitz smh2019 doplgangr muhammedeltabakh luckymurari jivt philawyer ridzuan05 milyasyousuf pht1987 diwahars digitaalhumanitaaria kmcodes zixan liveashish paulhendricks rayning0 alexanderwhatley thearchiver backupmanager codeinpeace vishnutadimeti shabuthomas prachi1210 yanlinaung mtvu pkthebud kpman whyserious mendax-grip quantumofcosmos paulm17 johnsonc 0wnrepo scotthavird radoraykov inan1993 techscientist varunjuneja jvmsangkal xzflin jesseorndorff shohan494 python-list poliflix rajatkapoor wagmattei kaiserdan andirey mrsinguyen bf shararehn jpw82 ianxxiao dbrait

facebook-page-post-scraper's Issues

Current Py3.x unicode normalizer is flawed

When the script hits emojis and some non Latin characters a UnicodeEncodeError is thrown. Using ".encode('UTF-8')" causes a byte literal which is written as b'' in the CSV. Currently experimenting with:
def unicode_normalize(text):
return unicodedata.normalize('NFKD',text).encode('ASCII', 'ignore').decode('ASCII')

facebook group version ?

Could you also create a agroup version of this script? There is a high demand to scrap the whole info in a closed group ( by their admins) so using a token is also fine.

Scrap comment of a specific post

In the previous version, there is a option for scrap comment of s specific post.
#reader = [dict(status_id='167994586682731_804458666369650')]
How about in latest one? Thank you!

Python 3 Support

I am trying to install this on Python 3 and it doesn't seem to be working.
Is this version for Python 3 or 2 or am I installing it incorrectly.

Scraper not getting all posts from a group

Received multiple comments from people hitting this. Investigating.

Can you modify the code for the new Facebook interactions?

Thanks for a great tool... Can you modify the code for the new Facebook interactions (i.e., Love it, Sad, etc.)?

ERROR 404

Scraping cnn Facebook Page: 2017-05-17 20:22:00.719000

HTTP Error 400: Bad Request
Error for URL https://graph.facebook.com/v2.9/cnn/posts/?limit=100&access_token=|&fields=message,link,created_time,type,name,id,comments.limit(0).summary(true),shares,reactions.limit(0).summary(true): 2017-05-17 20:22:06.220000
Retrying.
HTTP Error 400: Bad Request
Error for URL https://graph.facebook.com/v2.9/cnn/posts/?limit=100&access_token=|&fields=message,link,created_time,type,name,id,comments.limit(0).summary(true),shares,reactions.limit(0).summary(true): 2017-05-17 20:22:11.563000
Retrying.
HTTP Error 400: Bad Request

anyone please help,
i think trouble with access_token

I need your help

I get 'https://www.facebook.com/pg/cnn/posts/' from cnn news in facebook , and base_url = "https://graph.facebook.com/v2.9/cnn/posts/?limit=100&access_token=1407412942683680|rKwGpQjFSaW-HVcoBC5gch8XNwU", how do you get it,

I want the website "https://www.facebook.com/search/posts/?q=hong%20kong" in hong kong from facebook , I finally get the base_url . Would you help me ? Thank you

use plotly

Consolidate Python 2 and 3 Versions

Python 3 more ubiquitous than when repository was first created. Shouldn't be too bad to implement.

Problem with posts in scandic languages

Hello!

I am getting this error because I am trying to get the posts from a norwegian fb page... Could you please help me to fix it?

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 52: ordinal not in range(128)

I tried to put this comment line at the beginning of my code, but it still does not work.

-- coding: utf-8 --

Thank you in advance!

Facebook reactions only for Posts and not for Comments

Hi,
it seems that the complete set of reactions are collected only for Posts and not for comments. Any reason why?

Best

Massimiliano

Python version and is it functional ?

Hi, what version of Python are you using ? May be my bad as I use R and I'm new to Python but in version 3.5.1 on windows I had to reformat the prints adding parenthesis, remove urllib2, change the Exception format as e: and then I run into
RESTART: C:\Users\Daniel\Documents\R\facebook-page-post-scraper-master\get_fb_posts_fb_page.py
Traceback (most recent call last):
File "C:\Users\Daniel\Documents\R\facebook-page-post-scraper-master\get_fb_posts_fb_page.py", line 111, in
scrapeFacebookPageFeedStatus(page_id, access_token)
File "C:\Users\Daniel\Documents\R\facebook-page-post-scraper-master\get_fb_posts_fb_page.py", line 81, in scrapeFacebookPageFeedStatus
"status_published", "num_likes", "num_comments", "num_shares"])
TypeError: a bytes-like object is required, not 'str'

My version:
Python 3.5.1 (v3.5.1:37a07cee5969, Dec 6 2015, 01:54:25) [MSC v.1900 64 bit (AMD64)] on win32

Wondering what I am doing wrong or is it that Faceboock broke this functionality again

Unable to scrape entire post list

Not sure if its indexing problem of FB api. Changing api version does not solve issue. Only happens on some pages, usually very large pages and typing in the url with app token shows missing posts on browser as well.

RiotGames, Redbull page extract will show the problem with incomplete scraping.

ability to handle dates that are before 1900

Because some pages are using the set date post to the year when the brand was created, like the page pnl.ro

Traceback (most recent call last):
File "romania_get_fb_posts_fb_page.py", line 195, in
scrapeFacebookPageFeedStatus(page_id, access_token)
File "romania_get_fb_posts_fb_page.py", line 174, in scrapeFacebookPageFeedStatus
access_token))
File "romania_get_fb_posts_fb_page.py", line 107, in processFacebookPageFeedStatus
'%Y-%m-%d %H:%M:%S') # best time format for spreadsheet programs
ValueError: year=1875 is before 1900; the datetime strftime() methods require year >= 1900

Count reactions by type

Can you count reactions separate columns by type, so LIKE, WOW, SAD etc would be separate counts? Thank you!

Unicode normalize in "get_fb_posts_fb_group.py" prepends 'b'

Everything that is passed to unicode_normalize comes out as "b'[input]'" instead of '[input]' as shown in screen shot

scraping error

Hi,i am getting error when using the code

Error for URL https://graph.facebook.com/v2.9/BiskraInfo/posts/?limit=100&access_token=<6436979392819205149>|<FILL IN>&fields=message,link,created_time,type,name,id,comments.limit(0).summary(true),shares,reactions.limit(0).summary(true): 2017-06-29 10:03:35.139865 Retrying. HTTP Error 400: Bad Request

just an other question, how to get app_id and app_secret from FB page that i want to scrap ?

Getting all posts from a page.

I'm doing a project analysing data and ideally I want to collect all posts from a Facebook page. But I know the FB graph API has a limit of 100.
I was wondering whether using pagination as in the get_fb_posts_fb_page.py file either:
(1) Ensures all posts going back in time get collected.
Or
(2) Navigates through different pages and gets only the first 100 posts from each page.

Your script is great!!!

Hi Minimaxir,

Thank you so much for the great script that you have created. I have some questions and would be highly appreciated if you can provide me some guidance on how to modify the script.

The current script does not support Unicode font (Vietnamese)
How can I limit number of posts to new posts in a fanpage within 24 hours?
How can I scrap posts from many fan pages at a same time. I have more than 50 pages need to scrap data and I do scrap daily. So repeating one process for 50 times is time consuming for me.

Thank you so much,
Quy

Script hangs

I've had this happen a couple of times recently: the script is running, but at some point begins to hang between requests.

While grabbing comments from a page can take several seconds, this lag seems to go on indefinitely, and I end up having to keyboard interrupt things:

File "get_fb_comments_from_fb.py", line 175, in
scrapeFacebookPageFeedComments(file_id, access_token)
File "get_fb_comments_from_fb.py", line 122, in scrapeFacebookPageFeedComments
comment['id'], access_token, 100)
File "get_fb_comments_from_fb.py", line 49, in getFacebookCommentFeedData
data = request_until_succeed(url)
File "get_fb_comments_from_fb.py", line 18, in request_until_succeed
response = urllib2.urlopen(req)
File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1241, in https_open
context=self._context)
File "/usr/lib/python2.7/urllib2.py", line 1195, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
File "/usr/lib/python2.7/httplib.py", line 1057, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.7/httplib.py", line 1097, in _send_request
self.endheaders(body)
File "/usr/lib/python2.7/httplib.py", line 1053, in endheaders
self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 897, in _send_output
self.send(msg)
File "/usr/lib/python2.7/httplib.py", line 859, in send
self.connect()
File "/usr/lib/python2.7/httplib.py", line 1278, in connect
server_hostname=server_hostname)
File "/usr/lib/python2.7/ssl.py", line 353, in wrap_socket
_context=self)
File "/usr/lib/python2.7/ssl.py", line 601, in init
self.do_handshake()
File "/usr/lib/python2.7/ssl.py", line 830, in do_handshake
self._sslobj.do_handshake()

Do you know what could be leading to this sort of behavior?

HTTP error 400

Hi i found your repository from some blogger

And he write like this (he modify your code like this)

But it is not working and HTTP erorr 400
(HTTP Error 400: Bad Request
^CTraceback (most recent call last):
File "get_fb_test.py", line 152, in
wan_data, num = fetch_feed()
File "get_fb_test.py", line 104, in fetch_feed
one_json = getFacebookPageFeedData(page_id, access_token, since, until)
File "get_fb_test.py", line 32, in getFacebookPageFeedData
data = json.loads(request_until_suceed(url))
File "get_fb_test.py", line 47, in request_until_suceed
time.sleep(5)
)

Also i did put my app_id app_secret and page id, since and until

Can you help me?

Belowing is that bloger's code

http://dizwe.tistory.com/8

Cutting off scrape after a certain date?

Is it possible to alter the code so that, for example, the scraper stops pulling after a certain period in time? Given that some pages have been around for years, I might only want 2017 and 2016 posts on a page. Is there a way to say "if year is 2015" stop pulling posts?

TypeError: list indices must be integers, not str

test_results is a list below one:
test_status = getFacebookPageFeedData(page_id, access_token, 1)["data"][0]

Your next function below, expect this to be dictionary and though an error :TypeError: list indices must be integers, not str

def processFacebookPageFeedStatus(status)

limit end date for scraping

Hi,

is there a way to stop scraping at a certain date ?

Comment Scraper NameError: name 'file_id' is not defined

Getting this and not sure why:

Traceback (most recent call last):
File "get_fb_comments_from_fb.py", line 236, in
scrapeFacebookPageFeedComments(file_id, access_token)
NameError: name 'file_id' is not defined

Not sure what I need to define. I ran the group scraper and the .csv is in the same folder

Get Reactions from Facebook Comments

Not seeing anything in the API docs about it yet. Will add functionality once available.

Comment scraper not working?

Scraping the posts using get_fb_posts_fb_page.py went on just fine but when I used get_fb_comments_from_fb.py to scrape the comments, it returned a CSV file with only the headers. Has anyone experienced this as well?

ascii error

how can i solve this error please :
UnicodeEncodeError: 'ascii' codec can't encode characters in position 40-43: ordinal not in range(128)

It shows when i try to scrape through pages that post in Arabic .
I know the solution is to use utf-8 but i don't know how to implement it on the code .
if you could help me please.

Broken character with some language(vi)

I try to use this amazing script with some facebook fanpage it work very good.
But have a problem in write out the data. For example all Vietnamese character like this:

CÃ³ khi cÃ²n khÃ´ng biáº¿t luÃ´n áº¥y chá»© :v

have any one got this issue for language with extenal character out of alphabet? And what way to fix it?

carriage return should be escaped if in the content

I found that "carriage return" should be escaped in the content otherwise the single row of data will be split into different rows.

But I can use'\r' to identify the newline, not sure if this is an issue.

Resulting CSV format error

When writing to CSV file, it loses it's format. For example following should be in 1 line if I'm not wrong but :
156915824368931_1104005622993275,"Example :

06:00 bla
06:30 bla
09:30 bla
10:00 bla
10:30 bla
11:40 bla
12:00 bla
13:40 bla
15:30 bla
19:00 bla
20:30 bla
22:30 bla
23:40 bla
00:30 bla
02:00 bla,,status,,2016-07-04 06:02:16,34,27,1,30,0,0,0,0,4

I can write script which solves this problem(put all in 1 line per entry) but is there anyway to solve this when writing to CSV file?

Database question

Hello again :)

Do you have any suggestions about how to modify the code from get_fb_posts_fb_page (2.7 version) in order to save the CSV files into a database?

Thank you in advance!

TypeError: the JSON object must be str, not 'bytes'

I have this issue using comment scraper for public pages. I've filled all variables correctly (app_id, app_secret and page id), have run the post scraper before and it finished successfully.

Following you can see the full error log:

$ python3 get_fb_comments_from_fb.py
Scraping <OMMITED> Comments From Posts: 2017-05-21 15:51:37.768667

Traceback (most recent call last):
  File "get_fb_comments_from_fb.py", line 220, in <module>
    scrapeFacebookPageFeedComments(file_id, access_token)
  File "get_fb_comments_from_fb.py", line 147, in scrapeFacebookPageFeedComments
    comments = json.loads(request_until_succeed(url))
  File "/usr/lib/python3.5/json/__init__.py", line 312, in loads
    s.__class__.__name__))
TypeError: the JSON object must be str, not 'bytes'

The page I'm scraping has posts and comments written in Brazilian Portuguese (PT-BR).

Differentiate Images and GIFs in comments (if possible)

GIF support in comments now rolling out, so should differentiate the alias. (current, it will do [[IMAGE]] for any image)

Single post scrape

Is it possible to limit the comments scraper to just one post? I tried playing around with the code, but didn't have any luck. I'm trying to avoid having to scrape pages like CNN for hours just for the one post!

Thanks!

Rearchitect Script for 14.4x Speedup in Reactions Scraping

Scraping reactions is relatively slow for large pages (15 minutes for CNN's FB page) and will get worse as time goes by.

For example, when scraping 100 Statuses:

Current Architecture

Return Post Metadata for All 100 Posts [1 HTTP Request]
For each post, Query All Reactions (6) [100 HTTP Requests]

The query occurs during processing of the post so no extra data manipulation is necessary.

Better Architecture

Return Post Metadata for All 100 Posts [1 HTTP Request]
For all queried posts, retrieve a single reaction count. Repeat for each reaction [6 HTTP Requests]

The Reaction output from each of the 6 vectors must be mapped to the corresponding post.

101/7 = 14.4x speedup in HTTP, which is the bottleneck.

The challenge is implementing the mapping in a way that is easy to read. Tracking progress with this issue.

Grab comments from unpublished page posts

I have access to a unpublished page post that I would like to grab the name of everyone who liked and shared and commented on the post. How could I do that?

Add thankful/pride "special" reactions field

Facebook seems addicted to these so should add them. Use Algebra to subtract sum of normal reactions from total reactions.

Add CLI

Be able to pass parameters from CLI (in addition to current behavior)

Comment scraper HTTP Error 400 for statuses with many comments

When scraping comments from statuses with a large number of comments, the Graph API will begin returning a 400 error after a few thousand comments. I've seen this post, but the error often occurs substantially before the 25,000th comment. Does anyone have any suggestions?

To reproduce:

https://graph.facebook.com/v2.9/95475020353_10159243063820354/comments/?limit=100&access_token=[YOUR APP ID]|[YOUR APP SECRET]&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=&after=NTIZD&fields=id,message,reactions.limit(0).summary(true),created_time,comments,from,attachment

How to get message from my page and reply messages and comments ?

Hi minimaxir,

Do you have any ideal how to get all messages from my page and how i can reply my message and comments (how to make its work with fb bot) ?

Thanks so much minimaxir,
Have a nice day !

until date issue

Hello Minimaxir,

In python code line no. 187 showing some error while running with since and until date.

code :
if 'paging' in statuses:
next_url = statuses['paging']['next']
until = re.search('until=([0-9]?)(&|$)', next_url).group(1)
paging = re.search(
'__paging_token=(.?)(&|$)', next_url).group(1)

error :

Traceback (most recent call last):
File "Fbpost.py", line 199, in
scrapeFacebookPageFeedStatus(group_id, access_token, since_date, until_date)
File "Fbpost.py", line 187, in scrapeFacebookPageFeedStatus
until = re.search('until=([0-9]*?)(&|$)', next_url).group(1)
AttributeError: 'NoneType' object has no attribute 'group'

can anybody help me what is the issue while running this new Facebook page python code

Thanks

IndexError: tuple index out of range

17200 Statuses Processed: 2017-05-19 21:59:50.230000

Traceback (most recent call last):
  File "C:/xxxxx/facebook-page-post-scraper/get_fb_posts_fb_group.py", line 187, in <module>
    scrapeFacebookPageFeedStatus(group_id, access_token)
  File "C:/xxxxx/facebook-page-post-scraper/get_fb_posts_fb_group.py", line 183, in scrapeFacebookPageFeedStatus
    (num_processed, datetime.datetime.now() - scrape_starttime)))
IndexError: tuple index out of range

HTTP error 400

I keep getting this error - tried several different pages but no luck. any idea why?

Instead of Page ID make it Post ID

How can I pull the users who interacted with a specific post ID? It's a dark post so its not on their facebook page.

I get an error

Traceback (most recent call last):

File "", line 173, in
scrapeFacebookPageFeedStatus(page_id, access_token)

File "", line 154, in scrapeFacebookPageFeedStatus
w.writerow(status_data + reactions_data)

UnicodeEncodeError: 'gbk' codec can't encode character '\u2022' in position 69: illegal multibyte sequence

GET request to 'nytimes' leads to SSL: CERTIFICATE_VERIFY_FAILED

I am a beginner and this might be a dumb question.
When I make a GET request to nytimes, I get this error:

urllib2.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:590)>

I have tried running my query on Facebook Graph API explorer, but it works perfectly well there.
Is there something I am missing?

How to get real UserId ?

Hi minimaxir,

From your scripts, i ran and got userId of people who comment or like page, but that userId only on that page, not their real userId.
How to get real userId from people who like or comment on public page ?

Thanks so much !

Facebook search support?

Hey! Do you think this could be adapted to run through all public posts on some keyword? FB currently at least allows scrolling through them.

E.g. https://www.facebook.com/search/str/search%2Bterm/keywords_search?filters_rp_author=public-feed

Thanks for the app again! :)

Edit: Or maybe with something like this just to decrease the number of posts scraped if looking for something particular.: https://graph.facebook.com/nytimes/search?q=keyword_here&type=post

Link example here: https://www.facebook.com/search/str/politics/keywords_search?filters_rp_author=134486075205