GithubHelp home page GithubHelp logo

billgreenwald / pubmed-batch-download Goto Github PK

View Code? Open in Web Editor NEW
106.0 106.0 43.0 31.24 MB

Batch download articles based on PMID (Pubmed ID)

License: MIT License

Ruby 27.38% Shell 0.35% Jupyter Notebook 44.87% Python 27.41%

pubmed-batch-download's People

Contributors

aguynamedryan avatar billgreenwald avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pubmed-batch-download's Issues

Update to avoid known mechanize error

Mechanize, after Ruby version 1.9, throws the error

too many connection resets (due to end of file reached - EOFError) after 0 requests on 26040640

for some websites. A workaround is needed to be able to grab documents from particular websites.

Errors downloading articles

** fetching of reprint 28341702 failed from error Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

Change require to require_relative

To enable using the script from another directory, it'd be good to change
require './pdfetch.rb'
to
require_relative './pdfetch.rb'

Thank you,

David

Trouble with Elsevier articles

Hello!

I am having trouble downloading Elsevier papers, even though I can access them through my academic network. Here are the PMIDs:

30898248
29934065
28325353
28256256

I have many more. Any help you can give is greatly appreciated!

Error:

Hi,

I'm trying to do a test of the program and am using your test file.

$ python fetch_pdfs.py -pmf example_pmf.tsv -out test1

However I'm getting a connection error - it seems that eutils.ncbi.nlm.nih.gov is no longer available...

Trying to fetch pmid 27547345
** fetching of reprint 27547345 failed from error HTTPConnectionPool(host='eutils.ncbi.nlm.nih.gov', port=80): Max retries exceeded with url: /entrez/eutils/elink.fcgi?dbfrom=pubmed&id=27547345&retmode=ref&cmd=prlinks (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2aaab511fb38>: Failed to establish a new connection: [Errno -2] Name or service not known'))

Thanks.

Error: Invalid URL 'DirectEmailBox-inPage'

I'm getting the following error. Has anyone else experienced this as well? Or is this likely a user error on my part?

Trying to fetch pmid 31619796 Trying genericCitationLabelled Trying pubmed_central_v2 Trying acsPublications Trying uchicagoPress Trying nejm Trying futureMedicine Trying science_direct ** fetching of reprint 31619796 failed from error Invalid URL 'DirectEmailBox-inPage': No schema supplied. Perhaps you meant http://DirectEmailBox-inPage?

I'm using
`
$ ~/anaconda3/bin/conda --version
conda 4.8.3

$ git log
commit 75220d9 (HEAD -> master, origin/master, origin/HEAD)
Author: Bill Greenwald [email protected]
Date: Sat Oct 12 14:50:03 2019 -0700 Update README.md
`

License

Hey, this is awesome!

Thanks for writing it :)

Have you considered adding a license?

Cheers

"failed from error Invalid URL"

Hi, thank you for your program! With all of my PMID's I get one of the following errors:

** fetching of reprint 33191945 failed from error Invalid URL 'voSN1zD2LAqLbgiL7dZrDuKtt2DeC6Ln3TW51UJm5FtsTdsf5zb1XYxjdjTAq5zn': No schema supplied. Perhaps you meant http://voSN1zD2LAqLbgiL7dZrDuKtt2DeC6Ln3TW51UJm5FtsTdsf5zb1XYxjdjTAq5zn?
Trying to fetch pmid 33013186
*
** fetching of reprint 30793269 failed from error Failed to parse: โœ“
Trying to fetch pmid 32388849
Trying genericCitationLabelled
Trying pubmed_central_v2
Trying acsPublications
Trying uchicagoPress
Trying nejm
Trying futureMedicine

Do you know how I can fix this?

Same error message

I'm getting the same error regardless of what I do.

python fetch_pdfs.py -pmf example_pmf.tsv
Trying to fetch pmid 28514316
** Reprint 28514316 cannot be fetched as ovid is not supported by the requests package.
python fetch_pdfs.py -pmid 30374447
Trying to fetch pmid 28514316
** Reprint 28514316 cannot be fetched as ovid is not supported by the requests package.

the pubmed ID it is even requesting is incorrect....

Error with pubmedid2pdf.rb

Hi, I'm new to Bash (working on Linux Mint 17.2) and not a Ruby user. I installed Ruby 2.1.2 via RVM and the installation went fine. I ran bash setup.sh and the pubmedid2pdf.rb script, but obtained the error below:

[12:33] ~/.../Pubmed-Batch-Download$ ruby pubmedid2pdf.rb 26830047,26728431
/home/joanna/.rvm/rubies/ruby-2.1.2/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:54:in `require': cannot load such file -- camping (LoadError)
    from /home/joanna/.rvm/rubies/ruby-2.1.2/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:54:in `require'
    from /home/joanna/Dropbox/Sketchbook/ruby/Pubmed-Batch-Download/pdfetch.rb:27:in `<top (required)>'
    from /home/joanna/.rvm/rubies/ruby-2.1.2/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:54:in `require'
    from /home/joanna/.rvm/rubies/ruby-2.1.2/lib/ruby/site_ruby/2.1.0/rubygems/core_ext/kernel_require.rb:54:in `require'
    from pubmedid2pdf.rb:37:in `<main>'

Would appreciate if you could suggest how this could be fixed or if I made a mistake somewhere. This is a great tool, and many thanks in advance.

Damaged PDF & fetching stops

Hi, I am trying to use the code with a couple of PMIDs, it is succeeding on downloading the pdfs, but they are coming damaged, and after 14 entries it gives the following error message:

Traceback (most recent call last):
File "fetch_pdfs.py", line 252, in
if type(e)==requests.ConnectionError and '104' in e[0][1][0]:
TypeError: argument of type 'int' is not iterable

After this message, the fetching is interrupted. Below are the PMIDs I am trying.

python fetch_pdfs.py -pmids 26633170,23682673,25040501,24628937,27174497,27547345,22610656,23858657,24998529,27859194,26991916,26742956,22268844,27547334,16299005,26658101,24458119,24850527,25859332,17522077,22739706,24628897,24232381,23127184,27329944,25480711,25253712,20574680,19333624,24131615,14761053,25704464,26507115,25754608,26655157,28308115,27551374,21777248,24372301,28568420,28309130,22711559,19874617,27777723,26199373,22680336,16004288,26949084,23624924,23339242,22074778,19763848,22666114,27680661,19324745,24138122,23603953,21833640,25002701,24933810,18724731,26070638,28312167,17750894,18707428,16670987,25664897,4066794,21546431,19663992,12803910,24800839,20636902,27038018,25948688,25165527,27648239,24266037,26482059,18593688,27146894,11222244,21636492,23002269,10860912,26987770,25002705,24743567,28311501,23294438,28310242,21237765,23134452,27870050,24372761,21653461,19704675,28565336,19367315,15271088,19910534,23963860,12858276,20576739,28564966,28565464,24287813,25272164,21484398,25347541,28313987,25130655,26817765,22151952,15255098,22652419,21134082,17652341,26573095,24766107,20408751,17711841,28313163,26578721,18289396,28547066,19131378,19121112,19324662,24317664,11080108,27767040,10205070,28310724,22805583,24193000,19412706,21642227,26878831,21632396,26421845,28309726,20592812,25903102,19218583,19001427,21789530,20345818,20047872,28310543,24464206,10568781,20676914,22438504,10431223,20954889,28547089,22519776,11607153,12659040,22156401,19429671,15596454,16371444,19398446,27851814,27714795,28307360,28308328,12437082,19654608,19050951,19516075,28593665,19153768,21636399,22476079,21170748,19126635,28312388,11539321,19218577,16615203,9299797,28565680,14652688,16133196,18637960,16866959,16593140,28564904,28568165,21669711,29673012,18761503,21669696,16866958,14551828,20961923,17879195,17416914,28312462,19443460,18707369,21755150,21636368,17427121,17300430,21665640,28698790,28309456,27864223,28312030,15696741,11222245,28311108,21642173,29880773,17203434,28877178,18426489,20952615,19739370,18031491,29134400,28568788,19158031,29280577,28313078,28428861,21653420,15696748,15280895,11353709,10860920,12207039,28626040,15212378,29532921,28204486,29765587,28960844,29658115,29346506,29468326,28904775,28428199,27915467,28798863,28135774,28647753,28861252,28822496,29947735,29917223,28079938,28504871,29464694,29893413,29878057,29878055,29882762,29445017 -maxRetries 3

Any thoughts are much appreciated

Files are downloaded successfully, but they seem corrupt.

Hey Bill,
The code is working fine and whenever possible, the files are getting downloaded. However, all of these pdfs seem corrupt.

To make sure I am not doing anything wrong, I created a virtual environment and downgraded all packages to what you were using when you developed this, still the issue persists,

Some PMIDs to replicate the issue would be the sample ones in your Readme.

Thanks in advance.

fetching error

I got the same error for all PMIDs I tried so far.
Eg, ** fetching of reprint 123 failed from error list index out of range
I use version pubmed-batch-download 3.0.0, python 3.7.4.

Download fails: NoneType object has no attribute..

I am far from python savvy so if this is a simple error on my part, I apologize.

I have installed the necessary software and get the following error.

$ python3.7 ~/Repositories/Pubmed-Batch-Download/fetch_pdfs.py -pmids 31336898
Output directory of fetched_pdfs did not exist. Created the directory.
Trying to fetch pmid 31336898
** fetching of reprint 31336898 failed from error 'NoneType' object has no attribute 'readline'

failed to fetch

Hello
I just installed the 2 required packages and tried to fetch a couple of refs (using either my PMID or the example_pmf.tsv) but I get the following errors:
Any suggestions?
thanks!

$ python fetch_pdfs.py -pmf example_pmf.tsv ~/anaconda2/lib/python2.7/site-packages/cryptography/hazmat/primitives/constant_time.py:26: CryptographyDeprecationWarning: Support for your Python version is deprecated. The next version of cryptography will remove support. Please upgrade to a 2.7.x release that supports hmac.compare_digest as soon as possible. utils.DeprecatedIn23, Trying to fetch pmid 27547345 ** fetching of reprint 27547345 failed from error 'NoneType' object has no attribute 'readline'

PMID extraction in bulk!

I have a list of Article title for which I wanted to extract PMID from NCBI, can I do it in one go?

Add interface for Zotero translators

Hi Bill,
I get two other types of error messages for papers I can access if I click through from pubmed.
I suspect that the "badstatusline" error may relate to the fact that I am running the queries from within WSL.

Some example papers are
25176136 - an open NEJM paper
26030325 - a PubMedCentral paper
17074775 - a European heart journal paper

I have given an example of each type of error message

Messages follow

Trying to fetch pmid 25176136
Trying genericCitationLabelled
Trying pubmed_central
Trying acsPublications
Trying uchicagoPress
Trying science_direct
** fetching of reprint 25176136 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://?
Trying to fetch pmid 26030325
** fetching of reprint 26030325 failed from error ('Connection aborted.', BadStatusLine("''",))
Trying to fetch pmid 17074775
** fetching of reprint 17074775 failed from error ('Connection aborted.', BadStatusLine("''",))

index out of range error

Hello,
I am getting an index out of range error for certain PMIDs.
I am able to download about 1 in 3 PMIDs I am seeking and the most common error is "index out of range"
Could the loop indexing be longer than the list of possible errors?
Many thanks

Error with Physiology Free articles

When fetching the physiology articles, I get:

python fetch_pdfs.py -pmid 11045978 Trying to fetch pmid 11045978 Trying genericCitationLabelled Trying pubmed_central_v2 Trying acsPublications Trying uchicagoPress Trying nejm Trying futureMedicine ** fetching reprint using the 'future medicine' finder... ** fetching of reprint 11045978 failed from error HTTPSConnectionPool(host='www.physiology.orghttps', port=443): Max retries exceeded with url: //www.physiology.org/doi/pdf/10.1152/ajpheart.2000.279.5.H2405 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x10e25a588>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))

This happens for virtually or their papers. Can you help?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.