GithubHelp home page GithubHelp logo

google-scholar-scrapy-spider's Introduction

google-scholar-scrapy-spider

Python Scrapy spider that searches Google Scholar for a particular keyword and extracts all search results from the product page. The spider will iterate through all pages returned by the keyword query. The following are the fields the spider scrapes for the Google Scholar search results page:

*Title *Link *Citations *Related Links *Number of Verions *Author *Publisher *Snippet

This Google Scholar spider uses Scraper API as the proxy solution. Scraper API has a free plan that allows you to make up to 1,000 requests per month which makes it ideal for the development phase, but can be easily scaled up to millions of pages per month if needs be .

Using the Google Scholar Spider

Make sure Scrapy is installed:

pip install scrapy

Set the keywords you want to search in Google Scholar.

queries = ['airbnb', 'covid-19']

Signup to Scraper API and get your free API key that allows you to scrape 1,000 pages per month for free. Enter your API key into the API variable:

API_KEY = '<YOUR_API_KEY>'

def get_url(url):
    payload = {'api_key': API_KEY, 'url': url, 'country_code': 'us'}
    proxy_url = 'http://api.scraperapi.com/?' + urlencode(payload)
    return proxy_url

By default, the spider is set to have a max concurrency of 5 concurrent requests as this the max concurrency allowed on Scraper APIs free plan. If you have a plan with higher concurrency then make sure to increase the max concurrency in the settings.py.

## settings.py

CONCURRENT_REQUESTS = 5
RETRY_TIMES = 5

# DOWNLOAD_DELAY
# RANDOMIZE_DOWNLOAD_DELAY

We should also set RETRY_TIMES to tell Scrapy to retry any failed requests (to 5 for example) and make sure that DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY aren’t enabled as these will lower your concurrency and are not needed with Scraper API.

To run the spider, use:

scrapy crawl scholar -o test.csv

Editing the Google Scholar Spider

The spider has 3 parts:

  1. start_requests - will construct the Google Scholar URL for the search queries and send the request to Google.
  2. parse - will extract all the search results from the Google Scholar search results.
  3. get_url - to scrape Google Scholar at scale without getting blocked we need to use a proxy solution. For this project we will use Scraper API so we need to create a function to send the request to their API endpoint.

If you want to scrape more or less fields from the search results page then edit the XPath selectors in the parse function:

def parse(self, response):
       print(response.url)
       position = response.meta['position']
       for res in response.xpath('//*[@data-rp]'):
           link = res.xpath('.//h3/a/@href').extract_first()
           temp = res.xpath('.//h3/a//text()').extract()
           if not temp:
               title = "[C] " + "".join(res.xpath('.//h3/span[@id]//text()').extract())
           else:
               title = "".join(temp)
           snippet = "".join(res.xpath('.//*[@class="gs_rs"]//text()').extract())
           cited = res.xpath('.//a[starts-with(text(),"Cited")]/text()').extract_first()
           temp = res.xpath('.//a[starts-with(text(),"Related")]/@href').extract_first()
           related = "https://scholar.google.com" + temp if temp else ""
           num_versions = res.xpath('.//a[contains(text(),"version")]/text()').extract_first()
           published_data = "".join(res.xpath('.//div[@class="gs_a"]//text()').extract())
           position += 1
           item = {'title': title, 'link': link, 'cited': cited, 'relatedLink': related, 'position': position,
                   'numOfVersions': num_versions, 'publishedData': published_data, 'snippet': snippet}
           yield item
       next_page = response.xpath('//td[@align="left"]/a/@href').extract_first()
       if next_page:
           url = "https://scholar.google.com" + next_page
           yield scrapy.Request(get_url(url), callback=self.parse,meta={'position': position})

If you don't want to scrape every page returned for that keyword then comment out the next_page section of the parse function.

google-scholar-scrapy-spider's People

Contributors

ian-kerins avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.