Intro

This is software used to scrape Google Scholar for citations by a particular author. It makes use of ckreibich's scholar.py, with a couple of modifications.

How to use

Setup

You will need Python3 installed on you computer. Ideally then you will want virtual environment installed (Note: if you have to install virtualenv make sure you use pip3 instead of pip).
Next, you will need to clone this repo and/or download the zip.

Finally, make and launch a new virtualenv.

$ virtualenv myvenv  # this will make a directory called myvenv
$ source myvenv/bin/activate

Install dependency
```
$ pip3 install beautifulsoup4
```

Running

Your first line of defence is the help menu. Run

$ python3 citation_scraper.py --help

for details.

In general you need input. The program takes in a file of author's names which would look something like this file zeppelin.txt:

Jimmy Page
John Bonham
Robert Plant
John Paul Jones

You must also specify where you want the output to go. Using the example file from above we could run the program as

$ python3 citation_scraper zeppelin.txt output.txt

Features

Caching

Google blocking the program mid-run used to be a show stopper. All of the citations already scraped would be lost and the program would crash. Until... CACHING!

Every time all of the citations for a particular author are scraped they are added to a cache file called .pickle_cache.dat which is created in the directory where the program is run. If the program crashes due to a KeyboardInterrupt (^C) or from a 503 from Google's servers, the progress so far is saved to this file so that on the next run the scraping can resume from where it left off.

Refined Search

Sometimes you want to limit your search only to authors that are part of a particular institute or university. By using the --words option one can specify that so that it's reflected in the results. For example --words "UC Santa Cruz Genomics Institute" will give only results from authors within that institute.

Waiting

the --wait option can be used to wait for a specified number of seconds between each query with the hopes that this won't upset Google. The effectiveness of this solution has not been verified.

Trouble shooting

Probably the only problem you will encounter is getting blocked by Google Scholar's API. There is a workaround!

You need:

Mozilla Firefox
A Firefox extension that allows you to export cookies in the Netscape cookie file format such as Cookie Exporter.

Then:

Navigate to one of the URLs that failed when requested (using Firefox)
Fill out the captcha
Export the the cookies from the page (as cookies.txt)

Save the file and run again but specify the -c option. For example

$ python3 citation_scraper zeppelin.txt output.txt -c cookies.txt

If problems persist, contact Jesse: [email protected]

jessebrennan / citation_scraper Goto Github PK

citation_scraper's Introduction

Intro

How to use

Setup

Running

Features

Caching

Refined Search

Waiting

Trouble shooting

citation_scraper's People

Contributors

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs