GithubHelp home page GithubHelp logo

syyunn / google-news-scraper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from philipperemy/google-news-scraper

0.0 1.0 0.0 94 KB

Google News Scraper for languages like Japanese, Chinese... [VPN Support]

License: MIT License

Python 100.00%

google-news-scraper's Introduction

Google News Scraper - Japanese and Chinese supported

For English articles, Google has a RSS feed that you can directly use. Click here for English.

Each scraped article has the following fields:

  • title: Title of the article
  • datetime: Publication date
  • content: Full content (text format) - best effort
  • link: URL where the article was published
  • keyword: Google News keyword used to find this article

How many articles can I fetch with this scraper?

No upper bound of course but it should be in the range 100,000 articles per day when scraping 24/7 with VPN enabled.

How to get started?

git clone [email protected]:philipperemy/google-news-scraper.git && cd google-news-scraper
virtualenv -p python3 venv && source venv/bin/activate # optional but recommended!
pip install -r requirements.txt
python main_no_vpn.py --keywords hello,toto --language ja  # for VPN support, scroll down!

Output example

Article 1

{
    "content": "(本文中の野村証券 [...] 生命経済研の熊野英生氏は指摘。  記事の全文 \n保護主義を根拠とする円高説を信じ込むのは禁物であり、実際は米貿易赤字縮小と円安が進むかもしれないとBBHの村田雅志氏は指摘。  記事の全文 \n",
    "datetime": "2015/11/03",
    "keyword": "米国の銀行業務",
    "link": "http://jp.reuters.com/article/idJPL3N12Y5QX20151104",
    "title": "再送-インタビュー:運用高度化、PEやハイイールド債増やす=長門・ゆうちょ銀社長"
}

Article 2

{
    "content": "記事保存 有料会員の方のみご利用になれます。[...] 詳しくは、こちら 電子版トップ速報トップ アルゼンチン、ドル、通貨ペソ、外貨取引 来春の新入社員を募集 記者など4職種 【週末新紙面】宅配+電子版お試し実施中! 天気 プレスリリース検索 アカウント一覧 訂正・おわび",
    "datetime": "2015/12/17",
    "keyword": "アルゼンチン",
    "link": "http://www.nikkei.com/article/DGXLASGM18H1B_Y5A211C1EAF000/",
    "title": "アルゼンチンの通貨ペソ、大幅下落 対ドルで36%安"
}

NOTE: The field content was truncated in the README.

VPN

Scraping Google News usually results in a ban for a few hours. Using a VPN with dynamic IP fetching is a way to overcome this problem.

In my case, I subscribed to this VPN: https://www.expressvpn.com/.

I provide a python binding for this VPN here: https://github.com/philipperemy/expressvpn-python.

Also make sure that:

Every time the script detects that Google has banned you, it will request the VPN to get a fresh new IP and will resume.

Questions/Answers

  • Why didn't you use the RSS feed provided by Google News? It does not exist for Japanese!
  • What is the best way to use this scraper? If you want to scrape a lot of data, I highly recommend you to subscribe to a VPN, preferably ExpressVPN (I implemented the VPN wrapper and the interaction with this scraper).

google-news-scraper's People

Contributors

philipperemy avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.