GithubHelp home page GithubHelp logo

ref2pdf's Introduction

ref2pdf

Description

  • Now with multiprocessing, it can download 1 paper per second approximately.

This Python package aims to automate the download of scientific papers from various sources including Unpaywall, Sci-Hub, and directly from publisher or other websites. It provides a flexible and extensible solution for obtaining papers in PDF format by using different strategies and methods.

The default is to use multiprocessing for the downloads, if that is not working, try the main_singleprocessing.py file. If sci-hub domains are not working, go to the config.py file and change the sci-hub domains to those that are working right now.

Turn on and off deleting the output folder in the config.py file.

Also set the input and output files and folders there.

What is not so good

  • Scraping. It is a last resort, but for sure it could be improved to catch the papers that the other methods don't find.

Features

  • Unpaywall Integration: Downloads open-access papers from Unpaywall using their API. (unfortunately, not so good, maybe it can be improved with an API key, but they say that it is not necessary)

  • Sci-Hub Integration: Attempts to get papers via Sci-Hub using multiple domain options.

  • Direct Scraping: Downloads PDFs directly from webpages.

  • File Checks: Validates if the downloaded PDFs are genuine and non-empty.

  • You can turn on and off the different methods in the config.py file.

Dependencies

  • requests
  • pdfplumber
  • beautifulsoup4
  • selenium
  • prettytable
  • wget
  • pandas
  • bibtexparser

Example notebook

You can find an example notebook in the examples folder.

Modules

Imports

Responsible for loading a DataFrame from various input formats like .bib, .csv, .json, .txt, lists, and dictionaries.

File Checks

Provides utility functions to validate PDF files and manage file duplicates.

Direct Download

Contains a function for directly downloading PDF files from URLs.

Sci-Hub

Manages the download process from Sci-Hub by using different domain options.

Scraper

Scrapes web pages to find possible PDF URLs and then downloads them.

Unpaywall

Downloads papers by communicating with the Unpaywall API.

Installation

Just download the repository and run the main.py file.

Usage

You can drop an input file in the input folder and run the main.py file. Also if you drop it in the root, it will work. The input file can be a .bib, .csv, .json, .txt, or a list of DOIs. The output will be in the output folder.

Contributing

Feel free to open issues or PRs if you have suggestions for improving this package.

License

Unlicence

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.