This is software used to scrape Google Scholar for citations by a particular author. It makes use of ckreibich's scholar.py, with a couple of modifications.
-
You will need Python3 installed on you computer. Ideally then you will want virtual environment installed (Note: if you have to install virtualenv make sure you use pip3 instead of pip).
-
Next, you will need to clone this repo and/or download the zip.
-
Finally, make and launch a new virtualenv.
$ virtualenv myvenv # this will make a directory called myvenv $ source myvenv/bin/activate
-
Install dependency
$ pip3 install beautifulsoup4
Your first line of defence is the help menu. Run
$ python3 citation_scraper.py --help
for details.
In general you need input. The program takes in a file of author's
names which would look something like this file zeppelin.txt
:
Jimmy Page
John Bonham
Robert Plant
John Paul Jones
You must also specify where you want the output to go. Using the example file from above we could run the program as
$ python3 citation_scraper zeppelin.txt output.txt
Google blocking the program mid-run used to be a show stopper. All of the citations already scraped would be lost and the program would crash. Until... CACHING!
Every time all of the citations for a particular author are scraped
they are added to a cache file called .pickle_cache.dat
which is
created in the directory where the program is run. If the program
crashes due to a KeyboardInterrupt (^C) or from a 503 from Google's
servers, the progress so far is saved to this file so that on the next
run the scraping can resume from where it left off.
Sometimes you want to limit your search only to authors that are part
of a particular institute or university. By using the --words
option
one can specify that so that it's reflected in the results. For example
--words "UC Santa Cruz Genomics Institute"
will give only results
from authors within that institute.
the --wait
option can be used to wait for a specified number of
seconds between each query with the hopes that this won't upset Google.
The effectiveness of this solution has not been verified.
Probably the only problem you will encounter is getting blocked by Google Scholar's API. There is a workaround!
You need:
-
Mozilla Firefox
-
A Firefox extension that allows you to export cookies in the Netscape cookie file format such as Cookie Exporter.
Then:
-
Navigate to one of the URLs that failed when requested (using Firefox)
-
Fill out the captcha
-
Export the the cookies from the page (as
cookies.txt
) -
Save the file and run again but specify the
-c
option. For example$ python3 citation_scraper zeppelin.txt output.txt -c cookies.txt
If problems persist, contact Jesse: [email protected]