inspirehep / refextract Goto Github PK

View Code? Open in Web Editor NEW

128.0 7.0 30.0 9.01 MB

Extract bibliographic references from (High-Energy Physics) articles.

License: GNU General Public License v2.0

Python 99.66% Shell 0.26% Dockerfile 0.08%

refextract's Introduction

refextract

About

A small library for extracting references used in scholarly communication.

Install

$ pip install refextract

Usage

To get structured information from a publication reference:

>>> from refextract import extract_journal_reference
>>> reference = extract_journal_reference('J.Phys.,A39,13445')
>>> print(reference)
{
    'extra_ibids': [],
    'is_ibid': False,
    'misc_txt': u'',
    'page': u'13445',
    'title': u'J. Phys.',
    'type': 'JOURNAL',
    'volume': u'A39',
    'year': '',
}

To extract references from a PDF:

>>> from refextract import extract_references_from_file
>>> references = extract_references_from_file('1503.07589.pdf')
>>> print(references[0])
{
    'author': [u'F. Englert and R. Brout'],
    'doi': [u'doi:10.1103/PhysRevLett.13.321'],
    'journal_page': [u'321'],
    'journal_reference': [u'Phys. Rev. Lett. 13 (1964) 321'],
    'journal_title': [u'Phys. Rev. Lett.'],
    'journal_volume': [u'13'],
    'journal_year': [u'1964'],
    'linemarker': [u'1'],
    'raw_ref': [u'[1] F. Englert and R. Brout, \u201cBroken symmetry and the mass of gauge vector mesons\u201d, Phys. Rev. Lett. 13 (1964) 321, doi:10.1103/PhysRevLett.13.321.'],
    'texkey': [u'Englert:1964et'],
    'year': [u'1964'],
}

To extract directly from a URL:

>>> from refextract import extract_references_from_url
>>> references = extract_references_from_url('https://arxiv.org/pdf/1503.07589.pdf')
>>> print(references[0])
{
    'author': [u'F. Englert and R. Brout'],
    'doi': [u'doi:10.1103/PhysRevLett.13.321'],
    'journal_page': [u'321'],
    'journal_reference': [u'Phys. Rev. Lett. 13 (1964) 321'],
    'journal_title': [u'Phys. Rev. Lett.'],
    'journal_volume': [u'13'],
    'journal_year': [u'1964'],
    'linemarker': [u'1'],
    'raw_ref': [u'[1] F. Englert and R. Brout, \u201cBroken symmetry and the mass of gauge vector mesons\u201d, Phys. Rev. Lett. 13 (1964) 321, doi:10.1103/PhysRevLett.13.321.'],
    'texkey': [u'Englert:1964et'],
    'year': [u'1964'],
}

Notes

refextract depends on pdftotext.

Acknowledgments

refextract is based on code and ideas from the following people, who contributed to the docextract module in Invenio:

Alessio Deiana
Federico Poli
Gerrit Rindermann
Graham R. Armstrong
Grzegorz Szpura
Jan Aage Lavik
Javier Martin Montull
Micha Moskovic
Samuele Kaplun
Thorsten Schwander
Tibor Simko

License

GPLv2

refextract's People

Stargazers

Watchers

refextract's Issues

Microservice

It's should be possible to run refextract in micro-service mode with a POST request with a reference returning the parsed JSON.

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

I get the following error when trying out the example code from the refextract docs. I will explain my system below.

Error: TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Installation Used

I have used pip install refextract via terminal on MacOS Version 10.11.6 (15G22010). I have success with the installation although I did have to manually install libmagic using brew install libmagic as I was getting an error inially.

Usage Used

I tried first,

from refextract import extract_references_from_file
references = extract_references_from_file('some-local-filename.pdf')
print(references)

and got the following error:

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Then, similar to the example code from the docs, I changed the code to,

from refextract import extract_references_from_file
references = extract_references_from_file('https://arxiv.org/pdf/1503.07589.pdf')
print(references)

which is the same error - TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

extract_references_from_file returns inconsistent data

Given this document Kotti et al. - 2023 - Machine Learning for Software Engineering A Terti.pdf

Expectation

i would expect

per linenumber only one reference to be found
even if it returns a reference object for the same line it should hold that r1 = r2 with r1, r2 from the refs each with the same lineno and especially that r1['title'] = r2['title']

Actual

the refs found contain multiple contradictory results.

ie.

Replicate me

install pytest-subtests.
call with the document attached above

#with subtests from pytest-subtests
def test_reference_consistency(path, subtests):
    """
    Ensure that for each line in the file, there are no inconsistent duplicate references.

    Given a list of references, there shall only exist two references r1 and r2 where r1.lineno = r2.lineno and r1 == r2.
    """
    refs = extract_references_from_file(path)

    # Group the references by line number
    lines = {}
    for ref in refs:
        lineno = ref['linemarker'][0]
        if lineno in lines:
            lines[lineno].append(ref)
        else:
            lines[lineno] = [ref]

    # Check for inconsistent duplicate references on each line
    consistency = True

    for lineno, refs in lines.items():
        if len(refs) == 1:
            continue

        assert len(refs) > 1

        with subtests.test('line', lineno=lineno, refs=refs) as st:
            # Check that each pair of references on the line are consistent duplicates
            for i in range(1, len(refs)):
                ref1 = refs[i - 1]
                ref2 = refs[i]

                assert r1 == r2, f"Found inconsistent references: {r1} and {r2}"

pdf: bug in extract_texkeys_from_pdf

Sentry: https://sentry.cern.ch/inspire-sentry/inspire-labs/group/822854/

CC: @michamos

Issue with non a-zA-Z auther names

It doesn't work well with non a-zA-Z author names.

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Hello. My code is:

from refextract import extract_references_from_file
import os

--- main ---

path="E:\finance Python\2022 business\1226 pdf\42_56\"
name="test.pdf"
file=name
print(file)
st = os.stat(file)
print(st)
references = extract_references_from_file(os.path.join(path, name))
print(references[0])

But unfortunately, I don't know why the path is an error. I also change the path to "test.pdf" but can not function. Please help!

year is taken out of report numbers

The year is often taken out of report numbers, resulting in invalid reportnumbers.
There is a whitelist of report numbers in https://github.com/inspirehep/refextract/blob/master/refextract/references/kbs/report-numbers.kb which needs updating

write script to find missing report numbers (in invenio-scripts/unrecognized_report_numbers.py)
update kbs/report-numbers.kb with results
write bibcheck rule to correct existing records

mmap resize unavailable

Hi
I am running refextract on MacOS Sonoma 14.0.
I had a consistent error with mmap resize in clean_pdf_file in engine.py.
To fix it, I removed the if elif then block in the function and used flush to do all the work as follows:
mmfile.flush(start, end + offset - start).
I believe that should do the work without invoking the system dependent resize and move.
Please let me know what you think.
Best wishes.
Ahmed

SyntaxError: invalid syntax using python 3.5

Hi,

I tried refextract using python 3.5, but it appears not to be compatible due to the fact that python 3.5 does not have raw unicode as they are unicode by default (see http://stackoverflow.com/a/27482285/108301). A suggested fix would be to use the u function from six.

Traceback (most recent call last):
  File "cli.py", line 1, in <module>
    from refextract import extract_references_from_file
  File "C:\Python35\lib\site-packages\refextract\__init__.py", line 28, in <module>
    from .references.api import (
  File "C:\Python35\lib\site-packages\refextract\references\api.py", line 37, in <module>
    from .engine import (get_kbs,
  File "C:\Python35\lib\site-packages\refextract\references\engine.py", line 166
    re_report = re.compile(ur'^(?P<name>[A-Z-]+)(?P<nums>[\d-]+)$', re.UNICODE)

SyntaxError: invalid syntax

TypeError: coercing to Unicode

Hi guys,
i installed refextract and bumped into this error. it happens no matter what file i am trying to open, whether from source or from url. For instance:
from refextract import extract_references_from_url reference = extract_references_from_url("http://arxiv.org/pdf/1503.07589v1.pdf")
Any ideas how to solve it?

The error:
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in ()
1 from refextract import extract_references_from_url
----> 2 reference = extract_references_from_url("http://arxiv.org/pdf/1503.07589v1.pdf")

/Users/pacausd1/anaconda/envs/python2/lib/python2.7/site-packages/refextract/references/api.pyc in extract_references_from_url(url, headers, chunk_size, **kwargs)
92 for chunk in req.iter_content(chunk_size):
93 f.write(chunk)
---> 94 references = extract_references_from_file(filepath, **kwargs)
95 except requests.exceptions.HTTPError as e:
96 raise FullTextNotAvailableError("URL not found: '{0}'".format(url)), None, sys.exc_info()[2]

/Users/pacausd1/anaconda/envs/python2/lib/python2.7/site-packages/refextract/references/api.pyc in extract_references_from_file(path, recid, reference_format, linker_callback, override_kbs_files)
132 raise FullTextNotAvailableError("File not found: '{0}'".format(path))
133
--> 134 docbody = get_plaintext_document_body(path)
135 reflines, dummy, dummy = extract_references_from_fulltext(docbody)
136 if not reflines:

/Users/pacausd1/anaconda/envs/python2/lib/python2.7/site-packages/refextract/references/engine.pyc in get_plaintext_document_body(fpath, keep_layout)
1400
1401 elif mime_type == "application/pdf":
-> 1402 textbody = convert_PDF_to_plaintext(fpath, keep_layout)
1403
1404 else:

/Users/pacausd1/anaconda/envs/python2/lib/python2.7/site-packages/refextract/documents/pdf.pyc in convert_PDF_to_plaintext(fpath, keep_layout)
487 into plaintext; each string is a line in the document.)
488 """
--> 489 if not os.path.isfile(CFG_PATH_PDFTOTEXT):
490 raise FileNotFoundError('Missing pdftotext executable')
491

/Users/pacausd1/anaconda/envs/python2/lib/python2.7/genericpath.pyc in isfile(path)
35 """Test whether a path is a regular file"""
36 try:
---> 37 st = os.stat(path)
38 except os.error:
39 return False

TypeError: coercing to Unicode: need string or buffer, NoneType found`

Populate journal DB from INSPIRE Journal collection

See: inspirehep/invenio@df7a34f

authors: catastrophic backtracking in regex

How to reproduce:

>>> from refextract import extract_references_from_string
>>> extract_references_from_string('G. W. and L. B. and M. M. G. and T. A. and E. L. I. and E. P. and X. M. and B. Urbaszek, Magneto-optics in transition metal diselenide monolayers. 2D Mater. 2, 34002 (2015).')

this hangs refextract for, at least, days.

The reason appears to be catastrophic backtracking in this regex:

refextract/refextract/authors/regexs.py

Lines 491 to 494 in 27588da

 re_weaker_author = ur""" 

  ## look closely for initials, and less closely at the last name. 

  (?:([A-Z]((\.\s?)|(\.?\s+)|(\-))){1,5} 

  (?:[^\s_<>0-9]+(?:(?:[,\.]\s*)|(?:[,\.]?\s+)))+)"""

DOIs on multiple lines are split up

We noticed with @kaplun that a DOI is sometimes broken up by TeX and appears on multiple lines. This is default behavior of the url LaTeX package (which is often used to typeset URLs and used internally by the hyperref package), contrarily to what @tsgit said earlier. In that case, only the part of the DOI on the first line is taken as a DOI, which is of course wrong.

The default behavior (in the \UrlBreaks and \UrlBigBreaks macros in url.sty) is to allow line breaks after:
. @ \ / ! _ | ; > ] ) , ? & ' + = # : and also after - if the [hyphens] option is passed to the package.

Unicode error thrown when parsing references

Got unicode error when parsing arxiv 1701.04322

[2017-01-18 14:56:30,362: ERROR/MainProcess] Task invenio_workflows.tasks.start[c031ad62-e178-43d7-a6a5-c288ca3a1da0] raised unexpected: UnicodeEncodeError('ascii', u'* Unknown citation found. Searching for book title in: , General topology. Mathematical Monographs, Vol. 60, PWN\u2014 Polish Scientific Publishers, Warsaw, (1977).', 112, 113, 'ordinal not in range(128)')
Traceback (most recent call last):
  File "/opt/inspire/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/inspire/src/flask-celeryext/flask_celeryext/app.py", line 52, in __call__
    return Task.__call__(self, *args, **kwargs)
  File "/opt/inspire/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/tasks.py", line 77, in start
    return text_type(run_worker(workflow_name, data, **kwargs).uuid)
  File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/worker_engine.py", line 52, in run_worker
    engine.process(objects, **kwargs)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 390, in process
    self._process(objects)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 547, in _process
    obj, self, callbacks, exc_info
  File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/engine.py", line 364, in Exception
    obj, eng, callbacks, exc_info
  File "/opt/inspire/src/workflow/workflow/engine.py", line 970, in Exception
    reraise(*exc_info)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 529, in _process
    self.run_callbacks(callbacks, objects, obj)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 465, in run_callbacks
    indent + 1)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 465, in run_callbacks
    indent + 1)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 481, in run_callbacks
    self.execute_callback(callback_func, obj)
  File "/opt/inspire/src/workflow/workflow/engine.py", line 564, in execute_callback
    callback(obj, self)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/arxiv.py", line 134, in arxiv_refextract
    mapped_references = extract_references(pdf.file.uri)
  File "/opt/inspire/src/inspire/inspirehep/modules/refextract/tasks.py", line 90, in extract_references
    reference_format="{title},{volume},{page}",
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/api.py", line 140, in extract_references_from_file
    override_kbs_files=override_kbs_files,
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 1358, in parse_references
    parse_references_elements(reference_lines, kbs, linker_callback)
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 819, in parse_references_elements
    ref_line, kbs, bad_titles_count, linker_callback)
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 630, in parse_reference_line
    look_for_undetected_books(splitted_citations, kbs)
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 659, in look_for_undetected_books
    search_for_book_in_misc(citation, kbs)
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/engine.py", line 668, in search_for_book_in_misc
    citation_element['misc_txt'])
  File "/usr/lib64/python2.7/socket.py", line 316, in write
    data = str(data) # XXX Should really reject non-string non-buffers
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 112: ordinal not in range(128)

Using refextract==0.1.0.

non-ascii journal titles

https://sentry.cern.ch/inspire-sentry/inspire-labs/group/820653/

Ininite loop on debian

Good day,

I have an extract.py file that takes a pdf location as parameter and executes the extract. This is working fine on Arch linux,
however on a debian 9 (or 10, up to date) it goes into an infinite loop.

command I am trying to run :
./extract.py /tmp/p_3f4b8d2131dca8b1e1890d1b890ceb26.pdf

extract.py source:
`
import sys
from refextract import extract_references_from_file

if len(sys.argv) != 2:
sys.exit()

references = extract_references_from_file(sys.argv[1])
`

when I ctrl+c the cycle, it gives the following output:
^CTraceback (most recent call last): File "./extract.py", line 9, in <module> references = extract_references_from_file(sys.argv[1]) File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/api.py", line 139, in extract_references_from_file override_kbs_files=override_kbs_files, File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/engine.py", line 1456, in parse_references parse_references_elements(reference_lines, kbs, linker_callback) File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/engine.py", line 878, in parse_references_elements clean_line, kbs, bad_titles_count, linker_callback) File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/engine.py", line 635, in parse_reference_line bad_titles_count) File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/tag.py", line 174, in tag_reference_line kbs=kbs, File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/tag.py", line 326, in process_reference_line tagged_line = identify_and_tag_authors(tagged_line, kbs['authors']) File "/home/sspm/.local/lib/python3.7/site-packages/refextract/references/tag.py", line 881, in identify_and_tag_authors re_auth, re_auth_near_miss = get_author_regexps() File "/home/sspm/.local/lib/python3.7/site-packages/refextract/authors/regexs.py", line 470, in get_author_regexps re.VERBOSE | re.UNICODE)) File "/usr/lib/python3.7/re.py", line 234, in compile return _compile(pattern, flags) File "/usr/lib/python3.7/re.py", line 286, in _compile p = sre_compile.compile(pattern, flags) File "/usr/lib/python3.7/sre_compile.py", line 764, in compile p = sre_parse.parse(p, flags) File "/usr/lib/python3.7/sre_parse.py", line 930, in parse p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0) File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub not nested and not items)) File "/usr/lib/python3.7/sre_parse.py", line 816, in _parse p = _parse_sub(source, state, sub_verbose, nested + 1) File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub not nested and not items)) File "/usr/lib/python3.7/sre_parse.py", line 816, in _parse p = _parse_sub(source, state, sub_verbose, nested + 1) File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub not nested and not items)) File "/usr/lib/python3.7/sre_parse.py", line 816, in _parse p = _parse_sub(source, state, sub_verbose, nested + 1) File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub not nested and not items)) File "/usr/lib/python3.7/sre_parse.py", line 816, in _parse p = _parse_sub(source, state, sub_verbose, nested + 1) File "/usr/lib/python3.7/sre_parse.py", line 426, in _parse_sub not nested and not items)) File "/usr/lib/python3.7/sre_parse.py", line 587, in _parse set = _uniq(set) File "/usr/lib/python3.7/sre_parse.py", line 413, in _uniq if item not in newitems: KeyboardInterrupt

Additional information:
Python version: 3.7
pdftotext version: 0.71.0 (this extracts text from the pdf just fine)
one of the pdf files that I used (one of many, it hangs on every one, but every one is successfully processed on Arch linux):
a.pdf

Use logging library

Currently refextract is full of print() statements, in order to inform about the current execution.
Best practice would be instead to simply use the logging library.

Year taken as page number when page number is 4 digits

Sometimes the code get the year as the page number. For example in this case:

Y. Fu, H. Liu, X. Su, Y. Mi, and S. Tian, "Probabilistic direct load flow algorithm for unbalanced distribution networks considering uncertainties of PV and load," IET Renewable Power Generation, vol. 13, no. 11, pp. 1968-1980, 2019.

Im getting that the publication year is 1968, instead of 2019.

Any help in getting through this issue will be greatly appreciated

refextract: authors separated by semicola

references like "T. Venumadhav; F.-Y. Cyr-Racine; K. N. Abazajian; and C. M. Hirata: Sterile neutrino dark matter: Weak interactions in the strong coupling epoch, Phys. Rev. D94, 043515 (2016), 1507.06655." are split at every ";" creating additional nonsense references with just an author.
I would propose to not split strings off without any number in it.

Syntax error in references/api.py line 96

You have a right parenthesis appearing the wrong place on line 96 of api.py:

  File "[myhome]/anaconda/envs/py35/lib/python3.5/site-packages/refextract/references/api.py", line 96
    raise FullTextNotAvailableError("URL not found: '{0}'".format(url)), None, sys.exc_info()[2]
                                                                       ^
SyntaxError: invalid syntax

Should this instead read
raise FullTextNotAvailableError("URL not found: '{0}'".format(url), None, sys.exc_info()[2])
?

Import refextract fails

I am trying to import refextract using python 2 and python 3 both fails . It shows the error below .

I installed it using pip install refextract as mentioned in readme file

Error in importing

(tensorflow_keras) C:\Users\anupamag\Desktop\PYTHON Examples>python paper_stats. py Traceback (most recent call last): File "paper_stats.py", line 3, in <module> import refextract File "C:\Users\anupamag\AppData\Local\Continuum\Anaconda3\envs\tensorflow_kera s\lib\site-packages\refextract\__init__.py", line 28, in <module> from .references.api import ( File "C:\Users\anupamag\AppData\Local\Continuum\Anaconda3\envs\tensorflow_kera s\lib\site-packages\refextract\references\api.py", line 96 raise FullTextNotAvailableError("URL not found: '{0}'".format(url)), None, s ys.exc_info()[2] ^ SyntaxError: invalid syntax

This error turns up simply while importing the module:

`
import sys
import re

from refextract import extract_journal_reference
from refextract import extract_references_from_file
from refextract import extract_references_from_url

reference = extract_references_from_url("https://arxiv.org/pdf/1704.06040.pdf")
print(reference)
`

I'm using Python 3.5.3. Is this a known issue?

Crash in TeXKeys extraction

Given the PDF available at: http://arxiv.org/pdf/1710.01077 refextract crashes in PyPDF2 code:

Traceback (most recent call last):
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 529, in _process
    self.run_callbacks(callbacks, objects, obj)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 465, in run_callbacks
    indent + 1)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 465, in run_callbacks
    indent + 1)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 481, in run_callbacks
    self.execute_callback(callback_func, obj)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 564, in execute_callback
    callback(obj, self)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/utils.py", line 135, in _decorator
    res = func(*args, **kwargs)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/actions.py", line 238, in refextract
    references = extract_references(uri, source)
  File "/opt/inspire/lib/python2.7/site-packages/timeout_decorator/timeout_decorator.py", line 81, in new_function
    return function(*args, **kwargs)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/refextract.py", line 95, in extract_references
    reference_format=u'{title},{volume},{page}'
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/api.py", line 149, in extract_references_from_file
    texkeys = extract_texkeys_from_pdf(path)
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/pdf.py", line 54, in extract_texkeys_from_pdf
    pdf = PdfFileReader(pdf_stream, strict=False)
  File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1084, in __init__
    self.read(stream)
  File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1803, in read
    idnum, generation = self.readObjectHeader(stream)
  File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1667, in readObjectHeader
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: 'f'

It should instead handle the exception and continue without extracting TeXKeys.

mmap: resizing not available

Hi all,

Has anyone experienced the following issue in Python 3.8 with version 1.1.2 (on macOS), or know what's causing it?

from refextract import extract_references_from_url
references = extract_references_from_url('https://arxiv.org/pdf/1503.07589.pdf')

  File "/Users/xxx/anaconda3/lib/python3.8/site-packages/refextract/references/engine.py", line 1412, in clean_pdf_file
    mmfile.resize(end + offset - start)
SystemError: mmap: resizing not available--no mremap()

What's surprising is that on line 1412 of engine.py, the mmfile object does appear to have the resize function variable. Unlike with other variables like flush(), however, its execution cannot be completed somehow. I cannot find the mremap function variable anywhere either.

Thanks

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

refextract: recognize DELPHI notes

From @annetteholtkamp on March 23, 2018 9:32

Expected Behavior

Reportnumbers like
DELPHI-2001-138 PHYS 909
or
DELPHI Note 2001-138/PHYS-909
should be recognized by refextract

Current Behavior

Currently these are put into $$m

Context

Victor has recently uploaded many DELPHI Notes but they are not gathering any citations.

Copied from original issue: inspirehep/inspire-next#3284

URL encoded DOIs are not recognized

DOIs can be URLencoded, which has the effect of replacing / by %2f. Refextract fails on this.
@kaplun is working on a patch, but we also need to run it again somehow on existing records after #12 has been fixed (LRR DOIs get mangled)

refextract: month in pubnote

refextract fails on journal references that include a month, e.g.
$$mIEEE Trans. Appl. Supercond., vol. 22, no. 3, Jun., Art. no. 6600804

IEEE journals are quite commonly cited that way, in particular in the IEEE journals themselves

dont split PTEP articleIDs at letter in the middle

at the moment
"[1] Iso S. and Orikasa Y. Prog. Theor. Exp. Phys. 2013, 023B08 (2013)."
is split into
"999C5 $$o1$$hIso S. and Orikasa Y.$$sPTEP,2013,023$$m08$$y2013"
but it should be
"999C5 $$o1$$hIso S. and Orikasa Y.$$sPTEP,2013,023B08$$y2013"

Make new release

refextract: volumes with dashes

refextract apparently doesn't understand dashes in volumes.

Nucl Phys A904-905 270c
becomes
Nucl.Phys.,A904,905

The search for
999c5s:Nucl.Phys.,A904,905
finds 192 records.

"http://arxiv.org/pdf/1307.2978.pdf has some dashes in the references. GROBID's output is pretty terrible in this case: <title/> M M Aggarwal <title level="m">Luo [STAR Collaboration], Nucl. Phys. A904-905, 911c (2013); L. Kumar [STAR Collaboration], Nucl. Phys. A904-905</title> (but at least is not mangling the dashes)."

Error in extract_references_from_file(path) method

trying to do

refextract.extract_references_from_file()

errors

TypeError                                 Traceback (most recent call last)
<ipython-input-39-4d7d70a5a8a2> in <module>()
      1 print(fnames[0])
----> 2 data = refextract.extract_references_from_file(fnames[0])

3 frames
/usr/local/lib/python3.6/dist-packages/refextract/references/api.py in extract_references_from_file(path, recid, reference_format, linker_callback, override_kbs_files)
    126         raise FullTextNotAvailableError(u"File not found: '{0}'".format(path))
    127 
--> 128     docbody = get_plaintext_document_body(path)
    129     reflines, dummy, dummy = extract_references_from_fulltext(docbody)
    130     if not reflines:

/usr/local/lib/python3.6/dist-packages/refextract/references/engine.py in get_plaintext_document_body(fpath, keep_layout)
   1399 
   1400     elif mime_type == "application/pdf":
-> 1401         textbody = convert_PDF_to_plaintext(fpath, keep_layout)
   1402 
   1403     else:

/usr/local/lib/python3.6/dist-packages/refextract/documents/pdf.py in convert_PDF_to_plaintext(fpath, keep_layout)
    455     into plaintext; each string is a line in the document.)
    456     """
--> 457     if not os.path.isfile(CFG_PATH_PDFTOTEXT):
    458         raise IOError('Missing pdftotext executable')
    459 

/usr/lib/python3.6/genericpath.py in isfile(path)
     28     """Test whether a path is a regular file"""
     29     try:
---> 30         st = os.stat(path)
     31     except OSError:
     32         return False

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

clean_pdf_file throws SystemError on MacOS with mmap: resizing not available

When running extract extract_references_from_file(path) on this file
Kotti et al. - 2023 - Machine Learning for Software Engineering A Terti.pdf
on MacOS Ventura 13.3.1 the following exception gets thrown.


    def clean_pdf_file(filename):
        """
        strip leading and/or trailing junk from a PDF file
        """
        with open(filename, 'r+b') as file, mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_WRITE) as mmfile:
            start = mmfile.find(b'%PDF-')
            if start == -1:
                # no PDF marker found
                LOGGER.debug('not a PDF file')
                return
            end = mmfile.rfind(b'%%EOF')
            offset = len(b'%%EOF')
            if start > 0:
                LOGGER.debug('moving and truncating')
                mmfile.move(0, start, end + offset - start)
                #mmfile.resize(end + offset - start)
                mmfile.flush()
            elif end > 0 and end + offset != mmfile.size():
                LOGGER.debug('truncating only')
>               mmfile.resize(end + offset - start)
E               SystemError: mmap: resizing not available--no mremap()

../venv/lib/python3.10/site-packages/refextract/references/engine.py:1412: SystemError

Refextract fails to extract from two-columned layout pdf

Input PDF has two-columned layout. Refextract outputs empty array of references.

from refextract import extract_references_from_url
references = extract_references_from_url('https://arxiv.org/pdf/1710.11035.pdf')
print(references[0])

Input PDF has one-columned layout. Refextract works fine.

from refextract import extract_references_from_url
references = extract_references_from_url('https://arxiv.org/pdf/1509.03588.pdf')
print(references[0])

How can I allow refextract to parse both type of layouts?

Thank you.

refextract: recognize system identifiers

Sometimes a url contains a system identifier, e.g.
https://cds.cern.ch/record/2064383
in https://inspirehep.net/record/1611588
refextract could isolate the system id and put it with appropriate prefix into a designated field.
Create a new field for external identifiers? Or put it into the same field as a DOI?
This system id can then be used to link the reference to a record.
We have more than 3000 records with a CDS url in the refs.

Similarly for ADS (~1000 records):
e.g. http://adsabs.harvard.edu/abs/1990ApJ...360..242S

	re_weaker_author = ur"""
	## look closely for initials, and less closely at the last name.
	(?:([A-Z]((\.\s?)\|(\.?\s+)\|(\-))){1,5}
	(?:[^\s_<>0-9]+(?:(?:[,\.]\s*)\|(?:[,\.]?\s+)))+)"""

inspirehep / refextract Goto Github PK

refextract's Introduction

refextract

About

Install

Usage

Notes

Acknowledgments

License

refextract's People

Stargazers

Watchers

Forkers

refextract's Issues

Installation Used

Usage Used

Expectation

Actual

Replicate me

--- main ---

Expected Behavior

Current Behavior

Context

Recommend Projects

Recommend Topics

Recommend Org

Jobs