bibcure / scihub2pdf Goto Github PK

View Code? Open in Web Editor NEW

190.0 13.0 43.0 1.32 MB

Downloads pdfs via a DOI number, article title or a bibtex file, using the database of libgen(sci-hub) , arxiv

License: GNU Affero General Public License v3.0

Python 100.00%

sci-hub doi bibtex bibtexparser science scientific-journals latex arxiv

scihub2pdf's Introduction

bibcure (Beta Version)

Bibcure helps in boring tasks by keeping your bibtex file up to date and normalized.

Requirements/Install

Bibcure uses the wonderful Bibtex parser. In this moment we waiting for new release of python-bibtexparser to solve some bugs.

Install it using pip:

$ sudo python /usr/bin/pip install bibcure
# or
$ sudo pip install bibcure
# or
$ sudo pip3 install bibcure  # for Python 3

You can also install from the source: git clone the repository, and install with the setup.py script.

`scihub2pdf` (beta)

If you want download articles via a DOI number, article title or a bibtex file, using the database of arXiv, libgen or sci-hub, see bibcure/scihub2pdf.

Features and how to use

`bibcure`

Given a bib file...

$ bibcure -i input.bib -o output.bib

check sure the arXiv items have been published, then update them (requires internet connection),
complete all fields(url, journal, etc) of all bib items using DOI number (requires internet connection),
find and create DOI number associated with each bib item which has not DOI field (requires internet connection),
abbreviate journals names.

`arxivcheck`

Given an arXiv id...

$ arxivcheck 1601.02785

check if has been published, and then returns the updated bib (requires internet connection).

Given a title...

$ arxivcheck --title "A useful paper with hopefully unique title published on arxiv"

search papers related and return a bibtex file for the first item.

You can easily append a bib into a bibfile, just do

$ arxivcheck --title "A useful paper with hopefully unique title published on arxiv" >> file.bib

You also can interact with results, just pass --ask parameter:

$ arxivcheck --ask --title "A useful paper with hopefully unique title published on arxiv"

`scihub2pdf`

Given a bibtex file

$ scihub2pdf -i input.bib

Given a DOI number...

$ scihub2pdf 10.1038/s41524-017-0032-0

Given an arXiv id...

$ scihub2pdf arxiv:1708.06891

Given a title...

$ scihub2bib --title "A useful paper with hopefully unique title"

or arxiv...

$ scihub2bib --title arxiv:"A useful paper with hopefully unique title"

Location folder as argument:

$ scihub2pdf -i input.bib -l somefolder/

Use libgen instead sci-hub:

$ scihub2pdf --uselibgen -i input.bib

`doi2bib`

Given a DOI number...

$ doi2bib 10.1038/s41524-017-0032-0

get bib item given a DOI (requires internet connection)

You can easily append a bib into a bibfile, just do:

$ doi2bib 10.1038/s41524-017-0032-0 >> file.bib

You also can generate a bibtex from a txt file containing a list of DOIs:

$ doi2bib --input file_with_dois.txt --output refs.bib

`title2bib`

Given a title...

$ title2bib "A useful paper with hopefully unique title"

search papers related and return a bib for the selected paper (requires internet connection)

You can easily append a bib into a bibfile, just do

$ title2bib "A useful paper with hopefully unique title" --first >> file.bib

You also can generate a bibtex from a txt file containing a list of "titles"

$ title2bib --input file_with_titles.txt --output refs.bib --first

Comparison: Sci-Hub vs LibGen

Sci-Hub

Stable
Annoying CAPTCHA
Fast

Libgen

Unstable
No CAPTCHA
Slow

License

GNU Affero General Public License v3.0. For more details, see the LICENSE file.

scihub2pdf's People

Contributors

Stargazers

Watchers

scihub2pdf's Issues

Sci-Hub Address is Out-dated

scihub2pdf/scihub2pdf/download.py

Line 42 in 62c709a

domain_scihub = "http://sci-hub.cc/"

It seems that my school has banned the "sci-hub.cc" domain but the "sci-hub.bz" it is still working. Is there any way to change the domain? maybe passing it as an argument or using some kind of a config file?. A search between a list of possible domains can also be useful.

Exceptions with example sci2pdf 10.1038/s41524-017-0032-0

Debian 8

jaap@jaap:/$ sci2pdf 10.1038/s41524-017-0032-0
10.1038/s41524-017-0032-0
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 595, in urlopen
chunked=chunked)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 393, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 389, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.4/http/client.py", line 1172, in getresponse
response.begin()
File "/usr/lib/python3.4/http/client.py", line 351, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.4/http/client.py", line 321, in _read_status
raise BadStatusLine(line)
http.client.BadStatusLine: ''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 423, in send
timeout=timeout
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 640, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 261, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/lib/python3/dist-packages/six.py", line 685, in reraise
raise value.with_traceback(tb)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 595, in urlopen
chunked=chunked)
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 393, in _make_request
six.raise_from(e, None)
File "", line 2, in raise_from
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 389, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib/python3.4/http/client.py", line 1172, in getresponse
response.begin()
File "/usr/lib/python3.4/http/client.py", line 351, in begin
version, status, reason = self._read_status()
File "/usr/lib/python3.4/http/client.py", line 321, in _read_status
raise BadStatusLine(line)
requests.packages.urllib3.exceptions.ProtocolError: ('Connection aborted.', BadStatusLine("''",))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/sci2pdf", line 80, in
main()
File "/usr/local/bin/sci2pdf", line 68, in main
download_from_doi(value, location)
File "/usr/local/lib/python3.4/dist-packages/sci2pdf/libgen.py", line 91, in download_from_doi
bib_libgen = get_libgen_url(bib)
File "/usr/local/lib/python3.4/dist-packages/sci2pdf/libgen.py", line 22, in get_libgen_url
r = requests.get(url, params=params, headers=headers)
File "/usr/lib/python3/dist-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/usr/lib/python3/dist-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 475, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 596, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 473, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",))
jaap@jaap:/$

Allow user choose which system(libgen or scihub) the scihub2pdf should use.

Sci-hub:

Stable
Annoying CAPTCHA
Fast

Libgen

Unstalbe
No CAPTCHA
Slow

Scihub2pdf not works now! I guess the base scihub url root has changed.

root@ecs-6e13:~# scihub2pdf  doi:10.1016/j.patcog.2016.10.023  --uselibgen

	 Using Libgen.

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 144, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 60, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/usr/lib/python3.6/socket.py", line 745, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -5] No address associated with hostname

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 357, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/lib/python3.6/http/client.py", line 1281, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1327, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1276, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/lib/python3.6/http/client.py", line 1042, in _send_output
    self.send(msg)
  File "/usr/lib/python3.6/http/client.py", line 980, in send
    self.connect()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 169, in connect
    conn = self._new_conn()
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 153, in _new_conn
    self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fbf0d0f29e8>: Failed to establish a new connection: [Errno -5] No address associated with hostname

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 398, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='libgen.io', port=80): Max retries exceeded with url: /scimag/ads.php?doi=doi%3A10.1016%2Fj.patcog.2016.10.023&downloadname= (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbf0d0f29e8>: Failed to establish a new connection: [Errno -5] No address associated with hostname',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/scihub2pdf", line 191, in <module>
    main()
  File "/usr/local/bin/scihub2pdf", line 148, in main
    download_from_doi(value, location, use_libgen)
  File "/usr/local/lib/python3.6/dist-packages/scihub2pdf/download.py", line 161, in download_from_doi
    download_from_libgen(doi, pdf_file)
  File "/usr/local/lib/python3.6/dist-packages/scihub2pdf/download.py", line 68, in download_from_libgen
    found, r = ScrapLib.navigate_to(doi, pdf_file)
  File "/usr/local/lib/python3.6/dist-packages/scihub2pdf/libgen.py", line 44, in navigate_to
    headers=self.headers
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 533, in get
    return self.request('GET', url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 520, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 630, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 508, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='libgen.io', port=80): Max retries exceeded with url: /scimag/ads.php?doi=doi%3A10.1016%2Fj.patcog.2016.10.023&downloadname= (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fbf0d0f29e8>: Failed to establish a new connection: [Errno -5] No address associated with hostname',))

Download has stopped because of the captcha?

I have tried to download pdfs using the list of DOI that I have stored in the .txt file. Then, I got an issue after 3-4 pdfs are succesfully downloaded:

`DOI:  10.1016/j.telpol.2009.08.001
	Sci-Hub Link:  http://sci-hub.tw/10.1016/j.telpol.2009.08.001
	checking if has captcha...
	Download: ok

	DOI:  10.1080/0268396032000150816
	Sci-Hub Link:  http://sci-hub.tw/10.1080/0268396032000150816
	checking if has captcha...
Traceback (most recent call last):
  File "/Users/mustikarizkifitriyanti/anaconda/envs/thesis/bin/scihub2pdf", line 191, in <module>
    main()
  File "/Users/mustikarizkifitriyanti/anaconda/envs/thesis/bin/scihub2pdf", line 163, in main
    download_from_doi(value, location, use_libgen)
  File "/Users/mustikarizkifitriyanti/anaconda/envs/thesis/lib/python2.7/site-packages/scihub2pdf/download.py", line 163, in download_from_doi
    download_from_scihub(doi, pdf_file)
  File "/Users/mustikarizkifitriyanti/anaconda/envs/thesis/lib/python2.7/site-packages/scihub2pdf/download.py", line 105, in download_from_scihub
    captcha_img = ScrapSci.get_captcha_img()
  File "/Users/mustikarizkifitriyanti/anaconda/envs/thesis/lib/python2.7/site-packages/scihub2pdf/scihub.py", line 98, in get_captcha_img
    self.driver.execute_script("document.getElementById('content').style.zIndex = 9999;")
  File "/Users/mustikarizkifitriyanti/anaconda/envs/thesis/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 635, in execute_script
    'args': converted_args})['value']
  File "/Users/mustikarizkifitriyanti/anaconda/envs/thesis/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 320, in execute
    self.error_handler.check_response(response)
  File "/Users/mustikarizkifitriyanti/anaconda/envs/thesis/lib/python2.7/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: {"errorMessage":"null is not an object (evaluating 'document.getElementById('content').style')","request":{"headers":{"Accept":"application/json","Accept-Encoding":"identity","Connection":"close","Content-Length":"134","Content-Type":"application/json;charset=UTF-8","Host":"127.0.0.1:49931","User-Agent":"selenium/3.13.0 (python mac)"},"httpVersion":"1.1","method":"POST","post":"{\"sessionId\": \"927e3730-9652-11e8-ae2e-f99d263e318f\", \"args\": [], \"script\": \"document.getElementById('content').style.zIndex = 9999;\"}","url":"/execute","urlParsed":{"anchor":"","query":"","file":"execute","directory":"/","path":"/execute","relative":"/execute","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/execute","queryKey":{},"chunks":["execute"]},"urlOriginal":"/session/927e3730-9652-11e8-ae2e-f99d263e318f/execute"}}
Screenshot: available via screen
`

I wonder maybe this happens because that specific DOI has a captcha, Does anyone can help me to solve this issue?

Check if the bib item is an arxiv, then download the pdf from arxiv

Add more server domain options since sci-hub.cc is blocked sometime

As mentioned, is it possible to add one more option to choose server domain?

Direst output infos to files

Nice jobs. most articles are downloaded automaticlly, but a few cannot be found like below:
DOI: 10.1080/16742834.2014.11447220
Sci-Hub Link: https://sci-hub.se/10.1080/16742834.2014.11447220
checking if has captcha...
No pdf found. Maybe, the sci-hub dosen't have the file
Try to open the link in your browser.

Then, I'd like to grep the Sci-Hub Link only and save the Link to a.txt.
scihub2pdf --title ${article_title} > a.txt
doens't work, it will treat '${article_title} > a.txt' as a whole title.
Appreciate if you could solve it.
For example, add an option -d to direst the scihub2pdf infos to files. Thanks.

webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead

Selenium abadoned PhantomJS, Could you update scihub.py in which
self.driver = webdriver.PhantomJS()

There are many ways to verify that the page has been reloaded,

... time.sleep() is the only way I could implement that.... I need to finish my master's degree.

scihub2pdf/scihub2pdf/scihub.py

Line 99 in 7b9174d

time.sleep(5)

Make the CAPTCHA element stay always on the top.

Avoids the ADS of sci-hub to be shown in screenshot

Libgen.io/scimag is very unstable, change to sci-hub.cc(find a solution for the captcha problem)

Multiple requests at same time

Libgen site says the limit the connections per user is 40, but for some reason, I' only can do ~3 requests at the same time. I think this issue is related to my code(I've not yet studied the documentation of the lib requests.py)...libgen also can limit requests/per user in a given interval of time... I didn't find any information about that.

How can I run this under windows?

Im trying to run this on the cmd of windows 10 but it doesnt find scihub2pdf after installing it with pip, so how can I run it?

bibcure / scihub2pdf Goto Github PK

scihub2pdf's Introduction

bibcure (Beta Version)

Requirements/Install

scihub2pdf (beta)

Features and how to use

bibcure

arxivcheck

scihub2pdf

doi2bib

title2bib

Comparison: Sci-Hub vs LibGen

Sci-Hub

Libgen

License

scihub2pdf's People

Contributors

Stargazers

Watchers

Forkers

scihub2pdf's Issues

Sci-hub:

Libgen

Recommend Projects

Recommend Topics

Recommend Org

Jobs

`scihub2pdf` (beta)

`bibcure`

`arxivcheck`

`scihub2pdf`

`doi2bib`

`title2bib`