billgreenwald / pubmed-batch-download Goto Github PK

Batch download articles based on PMID (Pubmed ID)

License: MIT License

Ruby 27.38% Shell 0.35% Jupyter Notebook 44.87% Python 27.41%

pubmed-batch-download's Introduction

Pubmed-Batch-Download

Batch download articles based on PMID (Pubmed ID). This project is not being updated anymore; I no longer have access to paywall journals. If someone would like to pick up support in full, go ahead and fork. Otherwise, I contributions will have to be made by others and I can merge in PRs

Version 3.0.0 Last update: 9/15/2020.

Required Packages

As of version 3.0.0, the program is written for python 3.7. It uses the following non-default packages:

requests
requests3
beautifulsoup4
lxml

Optionally, instead of installing these yourself, the included "pubmed-batch-downloader-py3.yml" file can be used with anaconda to install an environment that has versions of packages and python known to work with this program. It can be on linux installed via

conda env create -f pubmed-batch-downloader-py3.yml

or on windows via

conda env create -f pubmed-batch-downloader-py3-windows.yml

Then, activate the environment with

conda activate pubmed-batch-downloader-py3

If you use the windows environment, you will then need to run the following commands in order to install the other packages, as I cannot get the yml to work when they are included.

conda install requests beautifulsoup4 lxml
conda install requests3

Program Usage

Each run will download the enumerated files to folder by default titled "fetched_pdfs" inside the application directory, with each pdf named the PMID correpsonding to the article. Articles already within the PDF folder will not be downloaded again.

Use the program via

python fetch_pdfs.py [-pmids or -pmf] [optional arguments]

Arguments: The program has the following arguments. It must be run with either -pmids or -pmf, not both. The help page can be displayed by running the program with -h, or with no arguments.

-pmids: A comma separated list of pmids to download
-pmf: A file with 1 or 2 columns of pmids and file names to download.  See below for example
-out: The output folder to store the downloaded pdfs.  By default, this is ./fetched_pdfs
-errors: File path to write all un-downloaded PMIDs during program run.  By default, this is ./unfetched_pmids.tsv.  This file is overwritten each run.
-maxRetries: Maximum number of times to try to redownload a pdf on an Connection Error (specifically, an ECONNRESET code 104).

PMF File Format: The -pmf file allows the user to input a file with a list of pmids, one per line, to download, instead of listing them in the command line with a comma separated list. This structure would be as follows

PMID1
PMID2
PMID3
...

Optionally, this file can have a second column, which is what to name the files when you download them. For example, if I wanted to download the article with pmid 123 and name it "Article_1.pdf" and pmid 4456 with name "Some_Other_Article.pdf", I would use the following pmf file (note, the columns are tab separated)

123 Article_1
4456  Some_Other_Article

When the program cannot download files, the non-downloaded PMIDs are stored in a PMF format file. This can then be directly used at a later date with the program. PMIDs and names are both stored within this file.

Example script usage:

python fetch_pdfs.py -pmids 123,124,125,23923,111

will place the files 123.pdf, 124.pdf, 125.pdf, 23923.pdf, and 111.pdf inside of the PDF folder, assuming all were found

Known download issues

The requests package cannot execute JavaScript, and thus pages that require javascript to load the link to the pdf or to the journal cannot be obtained with this program. As of now, this covers the Wolters Kluwer's journals.

pubmed-batch-download's People

Contributors

Stargazers

Watchers

pubmed-batch-download's Issues

License

Hey, this is awesome!

Thanks for writing it :)

Have you considered adding a license?

Cheers

use pmf with Ruby version?

Hi Bill,
Is it possible to use a pmf with the Ruby version of the script?
Thanks!

Invalid URL, no scheme supplied.

I got a list of PMIDs and most of the PMIDs return below error.
failed from error Invalid URL 'DYO5YSKQsvZXXy6uuDK4U4OqcUzpL1eBPhVPgvooI9ZjD1OcNxvES35gEbcFgwaa': No scheme supplied. Perhaps you meant http://DYO5YSKQsvZXXy6uuDK4U4OqcUzpL1eBPhVPgvooI9ZjD1OcNxvES35gEbcFgwaa?
Any idea?

Damaged PDF & fetching stops

Hi, I am trying to use the code with a couple of PMIDs, it is succeeding on downloading the pdfs, but they are coming damaged, and after 14 entries it gives the following error message:

Traceback (most recent call last):
File "fetch_pdfs.py", line 252, in
if type(e)==requests.ConnectionError and '104' in e[0][1][0]:
TypeError: argument of type 'int' is not iterable

After this message, the fetching is interrupted. Below are the PMIDs I am trying.

python fetch_pdfs.py -pmids 26633170,23682673,25040501,24628937,27174497,27547345,22610656,23858657,24998529,27859194,26991916,26742956,22268844,27547334,16299005,26658101,24458119,24850527,25859332,17522077,22739706,24628897,24232381,23127184,27329944,25480711,25253712,20574680,19333624,24131615,14761053,25704464,26507115,25754608,26655157,28308115,27551374,21777248,24372301,28568420,28309130,22711559,19874617,27777723,26199373,22680336,16004288,26949084,23624924,23339242,22074778,19763848,22666114,27680661,19324745,24138122,23603953,21833640,25002701,24933810,18724731,26070638,28312167,17750894,18707428,16670987,25664897,4066794,21546431,19663992,12803910,24800839,20636902,27038018,25948688,25165527,27648239,24266037,26482059,18593688,27146894,11222244,21636492,23002269,10860912,26987770,25002705,24743567,28311501,23294438,28310242,21237765,23134452,27870050,24372761,21653461,19704675,28565336,19367315,15271088,19910534,23963860,12858276,20576739,28564966,28565464,24287813,25272164,21484398,25347541,28313987,25130655,26817765,22151952,15255098,22652419,21134082,17652341,26573095,24766107,20408751,17711841,28313163,26578721,18289396,28547066,19131378,19121112,19324662,24317664,11080108,27767040,10205070,28310724,22805583,24193000,19412706,21642227,26878831,21632396,26421845,28309726,20592812,25903102,19218583,19001427,21789530,20345818,20047872,28310543,24464206,10568781,20676914,22438504,10431223,20954889,28547089,22519776,11607153,12659040,22156401,19429671,15596454,16371444,19398446,27851814,27714795,28307360,28308328,12437082,19654608,19050951,19516075,28593665,19153768,21636399,22476079,21170748,19126635,28312388,11539321,19218577,16615203,9299797,28565680,14652688,16133196,18637960,16866959,16593140,28564904,28568165,21669711,29673012,18761503,21669696,16866958,14551828,20961923,17879195,17416914,28312462,19443460,18707369,21755150,21636368,17427121,17300430,21665640,28698790,28309456,27864223,28312030,15696741,11222245,28311108,21642173,29880773,17203434,28877178,18426489,20952615,19739370,18031491,29134400,28568788,19158031,29280577,28313078,28428861,21653420,15696748,15280895,11353709,10860920,12207039,28626040,15212378,29532921,28204486,29765587,28960844,29658115,29346506,29468326,28904775,28428199,27915467,28798863,28135774,28647753,28861252,28822496,29947735,29917223,28079938,28504871,29464694,29893413,29878057,29878055,29882762,29445017 -maxRetries 3

Any thoughts are much appreciated

Update to avoid known mechanize error

Mechanize, after Ruby version 1.9, throws the error

too many connection resets (due to end of file reached - EOFError) after 0 requests on 26040640

for some websites. A workaround is needed to be able to grab documents from particular websites.

Error: Invalid URL 'DirectEmailBox-inPage'

I'm getting the following error. Has anyone else experienced this as well? Or is this likely a user error on my part?

Trying to fetch pmid 31619796 Trying genericCitationLabelled Trying pubmed_central_v2 Trying acsPublications Trying uchicagoPress Trying nejm Trying futureMedicine Trying science_direct ** fetching of reprint 31619796 failed from error Invalid URL 'DirectEmailBox-inPage': No schema supplied. Perhaps you meant http://DirectEmailBox-inPage?

I'm using
`
$ ~/anaconda3/bin/conda --version
conda 4.8.3

$ git log
commit 75220d9 (HEAD -> master, origin/master, origin/HEAD)
Author: Bill Greenwald [email protected]
Date: Sat Oct 12 14:50:03 2019 -0700 Update README.md
`

fetching error

I got the same error for all PMIDs I tried so far.
Eg, ** fetching of reprint 123 failed from error list index out of range
I use version pubmed-batch-download 3.0.0, python 3.7.4.

PMID extraction in bulk!

I have a list of Article title for which I wanted to extract PMID from NCBI, can I do it in one go?

Trying to fetch pmid 30374447 ** fetching of reprint 30374447 failed from error ('Connection aborted.', BadStatusLine("''",))

Trouble with Elsevier articles

Hello!

I am having trouble downloading Elsevier papers, even though I can access them through my academic network. Here are the PMIDs:

30898248
29934065
28325353
28256256

I have many more. Any help you can give is greatly appreciated!

Same error message

I'm getting the same error regardless of what I do.

python fetch_pdfs.py -pmf example_pmf.tsv
Trying to fetch pmid 28514316
** Reprint 28514316 cannot be fetched as ovid is not supported by the requests package.
python fetch_pdfs.py -pmid 30374447
Trying to fetch pmid 28514316
** Reprint 28514316 cannot be fetched as ovid is not supported by the requests package.

the pubmed ID it is even requesting is incorrect....

failed to fetch

Hello
I just installed the 2 required packages and tried to fetch a couple of refs (using either my PMID or the example_pmf.tsv) but I get the following errors:
Any suggestions?
thanks!

$ python fetch_pdfs.py -pmf example_pmf.tsv ~/anaconda2/lib/python2.7/site-packages/cryptography/hazmat/primitives/constant_time.py:26: CryptographyDeprecationWarning: Support for your Python version is deprecated. The next version of cryptography will remove support. Please upgrade to a 2.7.x release that supports hmac.compare_digest as soon as possible. utils.DeprecatedIn23, Trying to fetch pmid 27547345 ** fetching of reprint 27547345 failed from error 'NoneType' object has no attribute 'readline'

Error:

Hi,

I'm trying to do a test of the program and am using your test file.

$ python fetch_pdfs.py -pmf example_pmf.tsv -out test1

However I'm getting a connection error - it seems that eutils.ncbi.nlm.nih.gov is no longer available...

Trying to fetch pmid 27547345
** fetching of reprint 27547345 failed from error HTTPConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=80): Max retries exceeded with url: /entrez/eutils/elink.fcgi?dbfrom=pubmed&id=27547345&retmode=ref&cmd=prlinks (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2aaab511fb38>: Failed to establish a new connection: [Errno -2] Name or service not known'))

Thanks.

Add interface for Zotero translators

Hi Bill,
I get two other types of error messages for papers I can access if I click through from pubmed.
I suspect that the "badstatusline" error may relate to the fact that I am running the queries from within WSL.

Some example papers are
25176136 - an open NEJM paper
26030325 - a PubMedCentral paper
17074775 - a European heart journal paper

I have given an example of each type of error message

Messages follow

Trying to fetch pmid 25176136
Trying genericCitationLabelled
Trying pubmed_central
Trying acsPublications
Trying uchicagoPress
Trying science_direct
** fetching of reprint 25176136 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://?
Trying to fetch pmid 26030325
** fetching of reprint 26030325 failed from error ('Connection aborted.', BadStatusLine("''",))
Trying to fetch pmid 17074775
** fetching of reprint 17074775 failed from error ('Connection aborted.', BadStatusLine("''",))

Change require to require_relative

To enable using the script from another directory, it'd be good to change
require './pdfetch.rb'
to
require_relative './pdfetch.rb'

Thank you,

David

index out of range error

Hello,
I am getting an index out of range error for certain PMIDs.
I am able to download about 1 in 3 PMIDs I am seeking and the most common error is "index out of range"
Could the loop indexing be longer than the list of possible errors?
Many thanks

Download fails: NoneType object has no attribute..

I am far from python savvy so if this is a simple error on my part, I apologize.

I have installed the necessary software and get the following error.

$ python3.7 ~/Repositories/Pubmed-Batch-Download/fetch_pdfs.py -pmids 31336898
Output directory of fetched_pdfs did not exist. Created the directory.
Trying to fetch pmid 31336898
** fetching of reprint 31336898 failed from error 'NoneType' object has no attribute 'readline'

Files are downloaded successfully, but they seem corrupt.

Hey Bill,
The code is working fine and whenever possible, the files are getting downloaded. However, all of these pdfs seem corrupt.

To make sure I am not doing anything wrong, I created a virtual environment and downgraded all packages to what you were using when you developed this, still the issue persists,

Some PMIDs to replicate the issue would be the sample ones in your Readme.

Thanks in advance.

"failed from error Invalid URL"

Hi, thank you for your program! With all of my PMID's I get one of the following errors:

** fetching of reprint 33191945 failed from error Invalid URL 'voSN1zD2LAqLbgiL7dZrDuKtt2DeC6Ln3TW51UJm5FtsTdsf5zb1XYxjdjTAq5zn': No schema supplied. Perhaps you meant http://voSN1zD2LAqLbgiL7dZrDuKtt2DeC6Ln3TW51UJm5FtsTdsf5zb1XYxjdjTAq5zn?
Trying to fetch pmid 33013186
*

** fetching of reprint 30793269 failed from error Failed to parse: ✓
Trying to fetch pmid 32388849
Trying genericCitationLabelled
Trying pubmed_central_v2
Trying acsPublications
Trying uchicagoPress
Trying nejm
Trying futureMedicine

Do you know how I can fix this?

Error with pubmedid2pdf.rb

Hi, I'm new to Bash (working on Linux Mint 17.2) and not a Ruby user. I installed Ruby 2.1.2 via RVM and the installation went fine. I ran bash setup.sh and the pubmedid2pdf.rb script, but obtained the error below:

[12:33] ~/.../Pubmed-Batch-Download$ ruby pubmedid2pdf.rb 26830047,26728431
/home/joanna/.rvm/rubies/ruby-2.1.2/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:54:in `require': cannot load such file -- camping (LoadError)
    from /home/joanna/.rvm/rubies/ruby-2.1.2/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:54:in `require'
    from /home/joanna/Dropbox/Sketchbook/ruby/Pubmed-Batch-Download/pdfetch.rb:27:in `<top (required)>'
    from /home/joanna/.rvm/rubies/ruby-2.1.2/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:54:in `require'
    from /home/joanna/.rvm/rubies/ruby-2.1.2/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:54:in `require'
    from pubmedid2pdf.rb:37:in `<main>'

Would appreciate if you could suggest how this could be fixed or if I made a mistake somewhere. This is a great tool, and many thanks in advance.

Errors downloading articles

** fetching of reprint 28341702 failed from error Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

Problem when trying to download (PMIDs): 9029852, 19092482,11382209

This happened to all the three articles i've tried to download, what could be happening?

Thanks

Error with Physiology Free articles

When fetching the physiology articles, I get:

python fetch_pdfs.py -pmid 11045978 Trying to fetch pmid 11045978 Trying genericCitationLabelled Trying pubmed_central_v2 Trying acsPublications Trying uchicagoPress Trying nejm Trying futureMedicine ** fetching reprint using the 'future medicine' finder... ** fetching of reprint 11045978 failed from error HTTPSConnectionPool(host='www.physiology.orghttps', port=443): Max retries exceeded with url: //www.physiology.org/doi/pdf/10.1152/ajpheart.2000.279.5.H2405 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x10e25a588>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))

This happens for virtually or their papers. Can you help?

Thanks!