GithubHelp home page GithubHelp logo

citation_scraper's Introduction

Intro

This is software used to scrape Google Scholar for citations by a particular author. It makes use of ckreibich's scholar.py, with a couple of modifications.

How to use

Setup

  1. You will need Python3 installed on you computer. Ideally then you will want virtual environment installed (Note: if you have to install virtualenv make sure you use pip3 instead of pip).

  2. Next, you will need to clone this repo and/or download the zip.

  3. Finally, make and launch a new virtualenv.

    $ virtualenv myvenv  # this will make a directory called myvenv
    $ source myvenv/bin/activate
  4. Install dependency

    $ pip3 install beautifulsoup4

Running

Your first line of defence is the help menu. Run

$ python3 citation_scraper.py --help

for details.

In general you need input. The program takes in a file of author's names which would look something like this file zeppelin.txt:

Jimmy Page
John Bonham
Robert Plant
John Paul Jones

You must also specify where you want the output to go. Using the example file from above we could run the program as

$ python3 citation_scraper zeppelin.txt output.txt

Features

Caching

Google blocking the program mid-run used to be a show stopper. All of the citations already scraped would be lost and the program would crash. Until... CACHING!

Every time all of the citations for a particular author are scraped they are added to a cache file called .pickle_cache.dat which is created in the directory where the program is run. If the program crashes due to a KeyboardInterrupt (^C) or from a 503 from Google's servers, the progress so far is saved to this file so that on the next run the scraping can resume from where it left off.

Refined Search

Sometimes you want to limit your search only to authors that are part of a particular institute or university. By using the --words option one can specify that so that it's reflected in the results. For example --words "UC Santa Cruz Genomics Institute" will give only results from authors within that institute.

Waiting

the --wait option can be used to wait for a specified number of seconds between each query with the hopes that this won't upset Google. The effectiveness of this solution has not been verified.

Trouble shooting

Probably the only problem you will encounter is getting blocked by Google Scholar's API. There is a workaround!

You need:

  1. Mozilla Firefox

  2. A Firefox extension that allows you to export cookies in the Netscape cookie file format such as Cookie Exporter.

Then:

  1. Navigate to one of the URLs that failed when requested (using Firefox)

  2. Fill out the captcha

  3. Export the the cookies from the page (as cookies.txt)

  4. Save the file and run again but specify the -c option. For example

    $ python3 citation_scraper zeppelin.txt output.txt -c cookies.txt

If problems persist, contact Jesse: [email protected]

citation_scraper's People

Contributors

jessebrennan avatar

Forkers

mickymocombe

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.