GithubHelp home page GithubHelp logo

awicklund / async-pubmed-scraper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from iliazenkov/async-pubmed-scraper

0.0 0.0 0.0 3.16 MB

PubMed scraper for async search on a list of keywords and concurrent extraction of all found URLs, returning a DataFrame/CSV containing all article data (title, abstract, authors, affiliations, etc)

License: MIT License

Python 100.00%

async-pubmed-scraper's Introduction

Asynchronous PubMed Scraper

Quick Start

Instructions for Windows. Make sure you have python installed. Linux users: async_pubmed_scraper -h

  1. Open command prompt and change directory to the folder containing async_pubmed_scraper.py and keywords.txt
  2. Create a virtual environment: python -m pip install --user virtualenv, python -m venv scraper_env, .\scraper_env\Scripts\activate
  3. Install dependencies: pip install -r requirements.txt
  4. Enter list of keywords to scrape, one per line, in keywords.txt
  5. Enter python async_pubmed_scraper -h for usage instructions and you are good to go

    Example: To scrape the first 10 pages of search results for your keywords from 2018 to 2020 and save the data to the file article_data.csv: python async_pubmed_scraper --pages 10 --start 2018 --stop 2020 --output article_data

Example Usage and Data

Collects the following data at 13 articles/second: url, title, abstract, authors, affiliations, journal, keywords, date

What it does

This script asynchronously scrapes PubMed - an open-access database of scholarly research articles - and saves the data to a PANDAS DataFrame which is then written to a CSV intended for further processing. This script scrapes a user-specified list of keywords for all results pages asynchronously.

Why scrape when there's an API? Why asynchronous?

PubMed provides an API - the NCBI Entrez API, also known as Entrez Programming Utilities or E-Utilities - which can be used build datasets with their search engine - however, PubMed allows only 3 URL requests per second through E-Utilities (10/second w/ API key). We're doing potentially thousands of URL requests asynchronously - depending on the amount of articles returned by the command line options specified by the user. It's much faster to download articles from all urls on all results pages in parallel compared to downloading articles page by page, article by article, waiting for the previous article request to complete before moving on to a new one. It's not unusual to see a 10x speedup with async scraping compared to regular scraping.
Simply put, we're going to make our client send requests to all search queries and all resulting article URLs at the same time. Otherwise, our client will wait for the server to answer before sending the next request, so most of our script's execution time will be spent waiting for a response from the (PubMed) server.

License

License: MIT

async-pubmed-scraper's People

Contributors

iliazenkov avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.