inspirehep / refextract Goto Github PK
View Code? Open in Web Editor NEWExtract bibliographic references from (High-Energy Physics) articles.
License: GNU General Public License v2.0
Extract bibliographic references from (High-Energy Physics) articles.
License: GNU General Public License v2.0
trying to do
refextract.extract_references_from_file()
errors
TypeError Traceback (most recent call last)
<ipython-input-39-4d7d70a5a8a2> in <module>()
1 print(fnames[0])
----> 2 data = refextract.extract_references_from_file(fnames[0])
3 frames
/usr/local/lib/python3.6/dist-packages/refextract/references/api.py in extract_references_from_file(path, recid, reference_format, linker_callback, override_kbs_files)
126 raise FullTextNotAvailableError(u"File not found: '{0}'".format(path))
127
--> 128 docbody = get_plaintext_document_body(path)
129 reflines, dummy, dummy = extract_references_from_fulltext(docbody)
130 if not reflines:
/usr/local/lib/python3.6/dist-packages/refextract/references/engine.py in get_plaintext_document_body(fpath, keep_layout)
1399
1400 elif mime_type == "application/pdf":
-> 1401 textbody = convert_PDF_to_plaintext(fpath, keep_layout)
1402
1403 else:
/usr/local/lib/python3.6/dist-packages/refextract/documents/pdf.py in convert_PDF_to_plaintext(fpath, keep_layout)
455 into plaintext; each string is a line in the document.)
456 """
--> 457 if not os.path.isfile(CFG_PATH_PDFTOTEXT):
458 raise IOError('Missing pdftotext executable')
459
/usr/lib/python3.6/genericpath.py in isfile(path)
28 """Test whether a path is a regular file"""
29 try:
---> 30 st = os.stat(path)
31 except OSError:
32 return False
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
How to reproduce:
>>> from refextract import extract_references_from_string
>>> extract_references_from_string('G. W. and L. B. and M. M. G. and T. A. and E. L. I. and E. P. and X. M. and B. Urbaszek, Magneto-optics in transition metal diselenide monolayers. 2D Mater. 2, 34002 (2015).')
this hangs refextract
for, at least, days.
The reason appears to be catastrophic backtracking in this regex:
refextract/refextract/authors/regexs.py
Lines 491 to 494 in 27588da
It's should be possible to run refextract in micro-service mode with a POST request with a reference returning the parsed JSON.
Got unicode error when parsing arxiv 1701.04322
[2017-01-18 14:56:30,362: ERROR/MainProcess] Task invenio_workflows.tasks.start[c031ad62-e178-43d7-a6a5-c288ca3a1da0] raised unexpected: UnicodeEncodeError('ascii', u'* Unknown citation found. Searching for book title in: , General topology. Mathematical Monographs, Vol. 60, PWN\u2014 Polish Scientific Publishers, Warsaw, (1977).', 112, 113, 'ordinal not in range(128)')
Traceback (most recent call last):
File "/opt/inspire/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
R = retval = fun(*args, **kwargs)
File "/opt/inspire/src/flask-celeryext/flask_celeryext/app.py", line 52, in __call__
return Task.__call__(self, *args, **kwargs)
File "/opt/inspire/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
return self.run(*args, **kwargs)
File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/tasks.py", line 77, in start
return text_type(run_worker(workflow_name, data, **kwargs).uuid)
File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/worker_engine.py", line 52, in run_worker
engine.process(objects, **kwargs)
File "/opt/inspire/src/workflow/workflow/engine.py", line 390, in process
self._process(objects)
File "/opt/inspire/src/workflow/workflow/engine.py", line 547, in _process
obj, self, callbacks, exc_info
File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/engine.py", line 364, in Exception
obj, eng, callbacks, exc_info
File "/opt/inspire/src/workflow/workflow/engine.py", line 970, in Exception
reraise(*exc_info)
File "/opt/inspire/src/workflow/workflow/engine.py", line 529, in _process
self.run_callbacks(callbacks, objects, obj)
File "/opt/inspire/src/workflow/workflow/engine.py", line 465, in run_callbacks
indent + 1)
File "/opt/inspire/src/workflow/workflow/engine.py", line 465, in run_callbacks
indent + 1)
File "/opt/inspire/src/workflow/workflow/engine.py", line 481, in run_callbacks
self.execute_callback(callback_func, obj)
File "/opt/inspire/src/workflow/workflow/engine.py", line 564, in execute_callback
callback(obj, self)
File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/arxiv.py", line 134, in arxiv_refextract
mapped_references = extract_references(pdf.file.uri)
File "/opt/inspire/src/inspire/inspirehep/modules/refextract/tasks.py", line 90, in extract_references
reference_format="{title},{volume},{page}",
File "/opt/inspire/lib/python2.7/site-packages/refextract/references/api.py", line 140, in extract_references_from_file
override_kbs_files=override_kbs_files,
File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 1358, in parse_references
parse_references_elements(reference_lines, kbs, linker_callback)
File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 819, in parse_references_elements
ref_line, kbs, bad_titles_count, linker_callback)
File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 630, in parse_reference_line
look_for_undetected_books(splitted_citations, kbs)
File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 659, in look_for_undetected_books
search_for_book_in_misc(citation, kbs)
File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 668, in search_for_book_in_misc
citation_element['misc_txt'])
File "/usr/lib64/python2.7/socket.py", line 316, in write
data = str(data) # XXX Should really reject non-string non-buffers
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 112: ordinal not in range(128)
Using refextract==0.1.0.
We noticed with @kaplun that a DOI is sometimes broken up by TeX and appears on multiple lines. This is default behavior of the url
LaTeX package (which is often used to typeset URLs and used internally by the hyperref
package), contrarily to what @tsgit said earlier. In that case, only the part of the DOI on the first line is taken as a DOI, which is of course wrong.
The default behavior (in the \UrlBreaks
and \UrlBigBreaks
macros in url.sty) is to allow line breaks after:
. @ \ / ! _ | ; > ] ) , ? & ' + = # :
and also after -
if the [hyphens] option is passed to the package.
Hi,
I tried refextract
using python 3.5, but it appears not to be compatible due to the fact that python 3.5 does not have raw unicode as they are unicode by default (see http://stackoverflow.com/a/27482285/108301). A suggested fix would be to use the u
function from six
.
Traceback (most recent call last):
File "cli.py", line 1, in <module>
from refextract import extract_references_from_file
File "C:\Python35\lib\site-packages\refextract\__init__.py", line 28, in <module>
from .references.api import (
File "C:\Python35\lib\site-packages\refextract\references\api.py", line 37, in <module>
from .engine import (get_kbs,
File "C:\Python35\lib\site-packages\refextract\references\engine.py", line 166
re_report = re.compile(ur'^(?P<name>[A-Z-]+)(?P<nums>[\d-]+)$', re.UNICODE)
SyntaxError: invalid syntax
at the moment
"[1] Iso S. and Orikasa Y. Prog. Theor. Exp. Phys. 2013, 023B08 (2013)."
is split into
"999C5 $$o1$$hIso S. and Orikasa Y.$$sPTEP,2013,023$$m08$$y2013"
but it should be
"999C5 $$o1$$hIso S. and Orikasa Y.$$sPTEP,2013,023B08$$y2013"
Input PDF has two-columned layout. Refextract outputs empty array of references.
from refextract import extract_references_from_url
references = extract_references_from_url('https://arxiv.org/pdf/1710.11035.pdf')
print(references[0])
Input PDF has one-columned layout. Refextract works fine.
from refextract import extract_references_from_url
references = extract_references_from_url('https://arxiv.org/pdf/1509.03588.pdf')
print(references[0])
How can I allow refextract to parse both type of layouts?
Thank you.
Given this document Kotti et al. - 2023 - Machine Learning for Software Engineering A Terti.pdf
i would expect
r1 = r2
with r1, r2 from the refs each with the same lineno
and especially that r1['title'] = r2['title']
the refs found contain multiple contradictory results.
install pytest-subtests.
call with the document attached above
#with subtests from pytest-subtests
def test_reference_consistency(path, subtests):
"""
Ensure that for each line in the file, there are no inconsistent duplicate references.
Given a list of references, there shall only exist two references r1 and r2 where r1.lineno = r2.lineno and r1 == r2.
"""
refs = extract_references_from_file(path)
# Group the references by line number
lines = {}
for ref in refs:
lineno = ref['linemarker'][0]
if lineno in lines:
lines[lineno].append(ref)
else:
lines[lineno] = [ref]
# Check for inconsistent duplicate references on each line
consistency = True
for lineno, refs in lines.items():
if len(refs) == 1:
continue
assert len(refs) > 1
with subtests.test('line', lineno=lineno, refs=refs) as st:
# Check that each pair of references on the line are consistent duplicates
for i in range(1, len(refs)):
ref1 = refs[i - 1]
ref2 = refs[i]
assert r1 == r2, f"Found inconsistent references: {r1} and {r2}"
refextract apparently doesn't understand dashes in volumes.
Nucl Phys A904-905 270c
becomes
Nucl.Phys.,A904,905
The search for
999c5s:Nucl.Phys.,A904,905
finds 192 records.
"http://arxiv.org/pdf/1307.2978.pdf has some dashes in the references. GROBID's output is pretty terrible in this case: <title/> M M Aggarwal <title level="m">Luo [STAR Collaboration], Nucl. Phys. A904-905, 911c (2013); L. Kumar [STAR Collaboration], Nucl. Phys. A904-905</title> (but at least is not mangling the dashes)."
When running extract extract_references_from_file(path)
on this file
Kotti et al. - 2023 - Machine Learning for Software Engineering A Terti.pdf
on MacOS Ventura 13.3.1 the following exception gets thrown.
def clean_pdf_file(filename):
"""
strip leading and/or trailing junk from a PDF file
"""
with open(filename, 'r+b') as file, mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_WRITE) as mmfile:
start = mmfile.find(b'%PDF-')
if start == -1:
# no PDF marker found
LOGGER.debug('not a PDF file')
return
end = mmfile.rfind(b'%%EOF')
offset = len(b'%%EOF')
if start > 0:
LOGGER.debug('moving and truncating')
mmfile.move(0, start, end + offset - start)
#mmfile.resize(end + offset - start)
mmfile.flush()
elif end > 0 and end + offset != mmfile.size():
LOGGER.debug('truncating only')
> mmfile.resize(end + offset - start)
E SystemError: mmap: resizing not available--no mremap()
../venv/lib/python3.10/site-packages/refextract/references/engine.py:1412: SystemError
DOIs can be URLencoded, which has the effect of replacing /
by %2f
. Refextract fails on this.
@kaplun is working on a patch, but we also need to run it again somehow on existing records after #12 has been fixed (LRR DOIs get mangled)
The year is often taken out of report numbers, resulting in invalid reportnumbers.
There is a whitelist of report numbers in https://github.com/inspirehep/refextract/blob/master/refextract/references/kbs/report-numbers.kb which needs updating
invenio-scripts/unrecognized_report_numbers.py
)bibcheck
rule to correct existing recordsTypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
Hello. My code is:
from refextract import extract_references_from_file
import os
path="E:\finance Python\2022 business\1226 pdf\42_56\"
name="test.pdf"
file=name
print(file)
st = os.stat(file)
print(st)
references = extract_references_from_file(os.path.join(path, name))
print(references[0])
But unfortunately, I don't know why the path is an error. I also change the path to "test.pdf" but can not function. Please help!
I get the following error when trying out the example code from the refextract
docs. I will explain my system below.
Error: TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
I have used pip install refextract
via terminal on MacOS Version 10.11.6 (15G22010). I have success with the installation although I did have to manually install libmagic
using brew install libmagic
as I was getting an error inially.
I tried first,
from refextract import extract_references_from_file
references = extract_references_from_file('some-local-filename.pdf')
print(references)
and got the following error:
TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
Then, similar to the example code from the docs, I changed the code to,
from refextract import extract_references_from_file
references = extract_references_from_file('https://arxiv.org/pdf/1503.07589.pdf')
print(references)
which is the same error - TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType
Sometimes the code get the year as the page number. For example in this case:
Y. Fu, H. Liu, X. Su, Y. Mi, and S. Tian, "Probabilistic direct load flow algorithm for unbalanced distribution networks considering uncertainties of PV and load," IET Renewable Power Generation, vol. 13, no. 11, pp. 1968-1980, 2019.
Im getting that the publication year is 1968, instead of 2019.
Any help in getting through this issue will be greatly appreciated
Hi guys,
i installed refextract and bumped into this error. it happens no matter what file i am trying to open, whether from source or from url. For instance:
from refextract import extract_references_from_url reference = extract_references_from_url("http://arxiv.org/pdf/1503.07589v1.pdf")
Any ideas how to solve it?
The error:
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in ()
1 from refextract import extract_references_from_url
----> 2 reference = extract_references_from_url("http://arxiv.org/pdf/1503.07589v1.pdf")
/Users/pacausd1/anaconda/envs/python2/lib/python2.7/site-packages/refextract/references/api.pyc in extract_references_from_url(url, headers, chunk_size, **kwargs)
92 for chunk in req.iter_content(chunk_size):
93 f.write(chunk)
---> 94 references = extract_references_from_file(filepath, **kwargs)
95 except requests.exceptions.HTTPError as e:
96 raise FullTextNotAvailableError("URL not found: '{0}'".format(url)), None, sys.exc_info()[2]
/Users/pacausd1/anaconda/envs/python2/lib/python2.7/site-packages/refextract/references/api.pyc in extract_references_from_file(path, recid, reference_format, linker_callback, override_kbs_files)
132 raise FullTextNotAvailableError("File not found: '{0}'".format(path))
133
--> 134 docbody = get_plaintext_document_body(path)
135 reflines, dummy, dummy = extract_references_from_fulltext(docbody)
136 if not reflines:
/Users/pacausd1/anaconda/envs/python2/lib/python2.7/site-packages/refextract/references/engine.pyc in get_plaintext_document_body(fpath, keep_layout)
1400
1401 elif mime_type == "application/pdf":
-> 1402 textbody = convert_PDF_to_plaintext(fpath, keep_layout)
1403
1404 else:
/Users/pacausd1/anaconda/envs/python2/lib/python2.7/site-packages/refextract/documents/pdf.pyc in convert_PDF_to_plaintext(fpath, keep_layout)
487 into plaintext; each string is a line in the document.)
488 """
--> 489 if not os.path.isfile(CFG_PATH_PDFTOTEXT):
490 raise FileNotFoundError('Missing pdftotext executable')
491
/Users/pacausd1/anaconda/envs/python2/lib/python2.7/genericpath.pyc in isfile(path)
35 """Test whether a path is a regular file"""
36 try:
---> 37 st = os.stat(path)
38 except os.error:
39 return False
TypeError: coercing to Unicode: need string or buffer, NoneType found`
Currently refextract
is full of print()
statements, in order to inform about the current execution.
Best practice would be instead to simply use the logging library.
(tensorflow_keras) C:\Users\anupamag\Desktop\PYTHON Examples>python paper_stats. py Traceback (most recent call last): File "paper_stats.py", line 3, in <module> import refextract File "C:\Users\anupamag\AppData\Local\Continuum\Anaconda3\envs\tensorflow_kera s\lib\site-packages\refextract\__init__.py", line 28, in <module> from .references.api import ( File "C:\Users\anupamag\AppData\Local\Continuum\Anaconda3\envs\tensorflow_kera s\lib\site-packages\refextract\references\api.py", line 96 raise FullTextNotAvailableError("URL not found: '{0}'".format(url)), None, s ys.exc_info()[2] ^ SyntaxError: invalid syntax
This error turns up simply while importing the module:
`
import sys
import re
from refextract import extract_journal_reference
from refextract import extract_references_from_file
from refextract import extract_references_from_url
reference = extract_references_from_url("https://arxiv.org/pdf/1704.06040.pdf")
print(reference)
`
I'm using Python 3.5.3. Is this a known issue?
From @annetteholtkamp on March 23, 2018 9:32
Reportnumbers like
DELPHI-2001-138 PHYS 909
or
DELPHI Note 2001-138/PHYS-909
should be recognized by refextract
Currently these are put into $$m
Victor has recently uploaded many DELPHI Notes but they are not gathering any citations.
Copied from original issue: inspirehep/inspire-next#3284
Hi
I am running refextract on MacOS Sonoma 14.0.
I had a consistent error with mmap resize in clean_pdf_file
in engine.py
.
To fix it, I removed the if elif then
block in the function and used flush
to do all the work as follows:
mmfile.flush(start, end + offset - start)
.
I believe that should do the work without invoking the system dependent resize and move.
Please let me know what you think.
Best wishes.
Ahmed
You have a right parenthesis appearing the wrong place on line 96 of api.py:
File "[myhome]/anaconda/envs/py35/lib/python3.5/site-packages/refextract/references/api.py", line 96
raise FullTextNotAvailableError("URL not found: '{0}'".format(url)), None, sys.exc_info()[2]
^
SyntaxError: invalid syntax
Should this instead read
raise FullTextNotAvailableError("URL not found: '{0}'".format(url), None, sys.exc_info()[2])
?
Sometimes a url contains a system identifier, e.g.
https://cds.cern.ch/record/2064383
in https://inspirehep.net/record/1611588
refextract could isolate the system id and put it with appropriate prefix into a designated field.
Create a new field for external identifiers? Or put it into the same field as a DOI?
This system id can then be used to link the reference to a record.
We have more than 3000 records with a CDS url in the refs.
Similarly for ADS (~1000 records):
e.g. http://adsabs.harvard.edu/abs/1990ApJ...360..242S
references like "T. Venumadhav; F.-Y. Cyr-Racine; K. N. Abazajian; and C. M. Hirata: Sterile neutrino dark matter: Weak interactions in the strong coupling epoch, Phys. Rev. D94, 043515 (2016), 1507.06655." are split at every ";" creating additional nonsense references with just an author.
I would propose to not split strings off without any number in it.
Good day,
I have an extract.py file that takes a pdf location as parameter and executes the extract. This is working fine on Arch linux,
however on a debian 9 (or 10, up to date) it goes into an infinite loop.
command I am trying to run :
./extract.py /tmp/p_3f4b8d2131dca8b1e1890d1b890ceb26.pdf
extract.py source:
`
import sys
from refextract import extract_references_from_file
if len(sys.argv) != 2:
sys.exit()
references = extract_references_from_file(sys.argv[1])
`
when I ctrl+c the cycle, it gives the following output:
^CTraceback (most recent call last): File "./extract.py", line 9, in <module> references = extract_references_from_file(sys.argv[1]) File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/api.py", line 139, in extract_references_from_file override_kbs_files=override_kbs_files, File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/engine.py", line 1456, in parse_references parse_references_elements(reference_lines, kbs, linker_callback) File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/engine.py", line 878, in parse_references_elements clean_line, kbs, bad_titles_count, linker_callback) File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/engine.py", line 635, in parse_reference_line bad_titles_count) File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/tag.py", line 174, in tag_reference_line kbs=kbs, File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/tag.py", line 326, in process_reference_line tagged_line = identify_and_tag_authors(tagged_line, kbs['authors']) File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/tag.py", line 881, in identify_and_tag_authors re_auth, re_auth_near_miss = get_author_regexps() File "/home/sspm/.local/lib/python3.7/site-packages/refextract/authors/regexs.py", line 470, in get_author_regexps re.VERBOSE | re.UNICODE)) File "/usr/lib/python3.7/re.py", line 234, in compile return _compile(pattern, flags) File "/usr/lib/python3.7/re.py", line 286, in _compile p = sre_compile.compile(pattern, flags) File "/usr/lib/python3.7/sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "/usr/lib/python3.7/sre_parse.py", line 930, in parse p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0) File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub not nested and not items)) File "/usr/lib/python3.7/sre_parse.py", line 816, in _parse p = _parse_sub(source, state, sub_verbose, nested + 1) File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub not nested and not items)) File "/usr/lib/python3.7/sre_parse.py", line 816, in _parse p = _parse_sub(source, state, sub_verbose, nested + 1) File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub not nested and not items)) File "/usr/lib/python3.7/sre_parse.py", line 816, in _parse p = _parse_sub(source, state, sub_verbose, nested + 1) File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub not nested and not items)) File "/usr/lib/python3.7/sre_parse.py", line 816, in _parse p = _parse_sub(source, state, sub_verbose, nested + 1) File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub not nested and not items)) File "/usr/lib/python3.7/sre_parse.py", line 587, in _parse set = _uniq(set) File "/usr/lib/python3.7/sre_parse.py", line 413, in _uniq if item not in newitems: KeyboardInterrupt
Additional information:
Python version: 3.7
pdftotext version: 0.71.0 (this extracts text from the pdf just fine)
one of the pdf files that I used (one of many, it hangs on every one, but every one is successfully processed on Arch linux):
a.pdf
refextract fails on journal references that include a month, e.g.
$$mIEEE Trans. Appl. Supercond., vol. 22, no. 3, Jun., Art. no. 6600804
IEEE journals are quite commonly cited that way, in particular in the IEEE journals themselves
It doesn't work well with non a-zA-Z author names.
Given the PDF available at: http://arxiv.org/pdf/1710.01077 refextract crashes in PyPDF2 code:
Traceback (most recent call last):
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 529, in _process
self.run_callbacks(callbacks, objects, obj)
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 465, in run_callbacks
indent + 1)
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 465, in run_callbacks
indent + 1)
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 481, in run_callbacks
self.execute_callback(callback_func, obj)
File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 564, in execute_callback
callback(obj, self)
File "/opt/inspire/src/inspire/inspirehep/modules/workflows/utils.py", line 135, in _decorator
res = func(*args, **kwargs)
File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/actions.py", line 238, in refextract
references = extract_references(uri, source)
File "/opt/inspire/lib/python2.7/site-packages/timeout_decorator/timeout_decorator.py", line 81, in new_function
return function(*args, **kwargs)
File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/refextract.py", line 95, in extract_references
reference_format=u'{title},{volume},{page}'
File "/opt/inspire/lib/python2.7/site-packages/refextract/references/api.py", line 149, in extract_references_from_file
texkeys = extract_texkeys_from_pdf(path)
File "/opt/inspire/lib/python2.7/site-packages/refextract/references/pdf.py", line 54, in extract_texkeys_from_pdf
pdf = PdfFileReader(pdf_stream, strict=False)
File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1084, in __init__
self.read(stream)
File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1803, in read
idnum, generation = self.readObjectHeader(stream)
File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1667, in readObjectHeader
return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: 'f'
It should instead handle the exception and continue without extracting TeXKeys.
Hi all,
Has anyone experienced the following issue in Python 3.8 with version 1.1.2 (on macOS), or know what's causing it?
from refextract import extract_references_from_url
references = extract_references_from_url('https://arxiv.org/pdf/1503.07589.pdf')
File "/Users/xxx/anaconda3/lib/python3.8/site-packages/refextract/references/engine.py", line 1412, in clean_pdf_file
mmfile.resize(end + offset - start)
SystemError: mmap: resizing not available--no mremap()
What's surprising is that on line 1412 of engine.py, the mmfile object does appear to have the resize function variable. Unlike with other variables like flush(), however, its execution cannot be completed somehow. I cannot find the mremap function variable anywhere either.
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.