GithubHelp home page GithubHelp logo

inspirehep / inspire-dojson Goto Github PK

View Code? Open in Web Editor NEW
3.0 20.0 17.0 1.38 MB

INSPIRE-specific rules to transform from MARCXML to JSON and back.

License: GNU General Public License v3.0

Python 99.89% Shell 0.11%
inspirehep python marcxml json

inspire-dojson's Introduction

INSPIRE-DoJSON

https://travis-ci.org/inspirehep/inspire-dojson.svg?branch=master https://coveralls.io/repos/github/inspirehep/inspire-dojson/badge.svg?branch=master

About

INSPIRE-specific rules to transform from MARCXML to JSON and back.

inspire-dojson's People

Contributors

ammirate avatar bittirousku avatar david-caro avatar drjova avatar glignos avatar ioannistsanaktsidis avatar jacquerie avatar jalavik avatar jmartinm avatar kaplun avatar michamos avatar mihaibivol avatar mjedr avatar monaawi avatar panos512 avatar pascalegn avatar pazembrz avatar rikirenz avatar salmanmaq avatar spirosdelviniotis avatar szymonlopaciuk avatar tomaszgy avatar tsgit avatar vbalbp avatar zzacharo avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

inspire-dojson's Issues

pub info for books in journal field

From @annetteholtkamp on October 16, 2017 9:45

Expected Behavior

The publication info for books should appear in the 260 field

Current Behavior

Lots of older books have their pub info in 773__x, usually in the form
773__x:Cambridge, Uk: Univ. Pr. ( 1985) 376p
Sometimes series info appears in parentheses at the end, e.g.
Berlin, Germany: Suhrkamp (2012) 481 p, (Suhrkamp Taschenbuch Wissenschaft 2033)

Steps to Reproduce (for bugs)

tc b and not 260__b:** and 773__x:** and not 773__p:**
https://inspirehep.net/search?wl=0&ln=en&p=tc+b+and+not+260__b%3A**+and+773__x%3A**+and+not+773__p%3A**&of=hb&action_search=Search&sf=earliestdate&so=d&rm=&rg=250&sc=0

Context

Results e.g. in wrong bibtex entries.
It should be checked whether the info in 773__x is already present in 260, 300 and 490

Screenshots (if appropriate):

Copied from original issue: inspirehep/inspire-next#2864

dojson: more robust external_system_identifiers

The current implementation of external_system_identifier has some short comings:

  • currently considers only the first 035__a and 035__9 losing any additional value.
  • in case value is not set, it still adds an entry with the schema (producing invalid record).
  • exporting back to MARC associates to $a and $z only based on a whitelist of schemas.

So e.g. in record https://inspirehep.net/record/700376 there:

<datafield tag="035" ind1=" " ind2=" ">
<subfield code="9">OSTI</subfield>
<subfield code="a">892532</subfield>
</datafield>
<datafield tag="035" ind1=" " ind2=" ">
<subfield code="9">OSTI</subfield>
<subfield code="z">897192</subfield>
</datafield>

The second OSTI is currently lost.

dojson: handle obsolete ORCIDs

From @annetteholtkamp on March 24, 2017 9:15

In rare cases an author may have 2 Orcid's. Although we'll try to get these cases solved on the Orccid side we should be prepared to deal with them. Currently they are put into 035__z. In the new schema Orcid's should be unique. A second Orcid should be automatically hidden.

Copied from original issue: inspirehep/inspire-next#2129

hep: split mashed up authors lists

See:

@pytest.mark.xfail(reason='should split mashed up author list')
def test_authors_supervisors_from_100_a_u_w_y_z_and_701__double_a_u_z():
schema = load_schema('hep')
subschema = schema['properties']['authors']
snippet = (
'<record>'
' <datafield tag="100" ind1=" " ind2=" ">'
' <subfield code="a">Lang, Brian W.</subfield>'
' <subfield code="u">Minnesota U.</subfield>'
' <subfield code="w">B.W.Lang.1</subfield>'
' <subfield code="y">0</subfield>'
' <subfield code="z">903010</subfield>'
' </datafield>'
' <datafield tag="701" ind1=" " ind2=" ">'
' <subfield code="a">Poling, Ron</subfield>'
' <subfield code="a">Kubota, Yuichi</subfield>'
' <subfield code="u">Minnesota U.</subfield>'
' <subfield code="z">903010</subfield>'
' </datafield>'
'</record>'
) # record/776962/export/xme
expected = [
{
'affiliations': [
{
'record': {
'$ref': 'http://localhost:5000/api/institutions/903010',
},
'value': 'Minnesota U.',
},
],
'full_name': 'Lang, Brian W.',
'ids': [
{
'schema': 'INSPIRE BAI',
'value': 'B.W.Lang.1',
},
],
},
{
'affiliations': [
{
'value': 'Minnesota U.',
'record': {
'$ref': 'http://localhost:5000/api/institutions/903010'
}
}
],
'full_name': 'Poling, Ron',
'inspire_roles': [
'supervisor',
],
},
{
'affiliations': [
{
'value': 'Minnesota U.',
'record': {
'$ref': 'http://localhost:5000/api/institutions/903010',
},
},
],
'full_name': 'Kubota, Yuichi',
'inspire_roles': [
'supervisor',
],
}
]
result = hep.do(create_record(snippet))
assert validate(result['authors'], subschema) is None
assert expected == result['authors']
expected_100 = {
'a': 'Lang, Brian W.',
'u': [
'Minnesota U.',
],
}
expected_701 = [
{
'a': 'Poling, Ron',
'u': [
'Minnesota U.',
],
},
{
'a': 'Kubota, Yuichi',
'u': [
'Minnesota U.',
],
},
]
result = hep2marc.do(result)
assert expected_100 == result['100']
assert expected_701 == result['701']

normalize ISBNs to ISBN13 without dashes

On legacy, various ISBN formats are used: ISBN 10 and ISBN 13, without separator or separated by spaces or dashes. They should all be normalized to ISBN 13 without separator, of the form "978123456789X". Care has to be taken that some ISBNs on legacy are invalid: those should be migrated as is (and logged if possible).

add_inspire_categories is completely broken

reviewing the uses of classify_field, I discovered this:

def add_inspire_categories(record, blob):
if not record.get('arxiv_eprints') or record.get('inspire_categories'):
return record
for arxiv_category in force_list(get_value(record, 'arxiv_eprints.categories')):
inspire_category = classify_field(arxiv_category)
if inspire_category:
record['inspire_category'] = [
{
'source': 'arxiv',
'term': inspire_category,
},
]
return record

It is completely broken as:

  1. inspire_category does not exist in the schema (inspire_categories is the correct form);
  2. it overwrites the output on every iteration.

I don't know if it is ever run, as it requires arxiv_eprints to be present but inspire_categories not to be, which should be pretty rare, and I have never seen any migration error because of this field.

It would be good to review the whole logic around categories in dojson as it's quite intricate.

refextract: texkey extraction

refextract extracts texkeys from pdf. These texkeys are the ones used by the author and are not necessarily Inspire texkeys. So the syntax of the extracted texkeys need to be checked before adding them.

hep: migrate accelerator information

If a record has a 693__a but no 693__e, that field should be migrated to acelerator_experiments.accelerator. Otherwise, the 693__a should be discarded.

utils: spin out jsonref utils to inspire-utils

These two guys:

def get_recid_from_ref(ref_obj):
"""Retrieve recid from jsonref reference object.
If no recid can be parsed, returns None.
"""
if not isinstance(ref_obj, dict):
return None
url = ref_obj.get('$ref', '')
return maybe_int(url.split('/')[-1])
and
def get_record_ref(recid, endpoint='record'):
"""Create record jsonref reference object from recid.
None recids will return a None object.
Valid recids will return an object in the form of: {'$ref': url_for_record}
"""
if recid is None:
return None
return {'$ref': absolute_url('/api/{}/{}'.format(endpoint, recid))}
.

The problem is that they depend on absolute_url, so this requires #57 to be fixed.

Is storing the absolute URLs in $refs a good idea?

From @jmartinm on July 5, 2016 14:50

At the moment, as far as I can see, both in the database and in ElasticSearch the $refs between records are expressed as absolute URLs. Talking to @david-caro the question came up of how, for example, could one load a dump of production locally and still have it working (if the references are absolute).

I have seen a couple of 'related' issues on Invenio:

inveniosoftware/invenio-records#117
inveniosoftware/invenio-jsonschemas#23

This same problem might happen, for instance, the moment we switch from labs.inspirehep.net to inspirehep.net, what will we do with the $refs?

Copied from original issue: inspirehep/inspire-next#1295

conferences should not populate postal_address

As noticed by @Dinika, for conferences we are populating in addresses not only cities, state and country, but also postal_address. The postal_address is the unparsed address coming from legacy, that is parsed to produce the other fields. This makes it redundant and it should not be used for conferences, as stated by the schema https://github.com/inspirehep/inspire-schemas/blob/cdfff3a630c7c313d4af0ec70ddd95a441044962/inspire_schemas/records/conferences.yml#L45-L46.

'authors.full_name' should be unicode

There might be a discrepancy on what type the full_name has when it leaves dojson.

Why

Well, firstly from @iulianav 's PR and the discussion here we got the first idea that this is happening (with her latest commit).

Secondly, after a discussion with @jacquerie IRL, we had the impression that it was inspire_utils.name::normalize_name that had the issue.

But then in this commit, after I converted all string literals in the expected_name from str to unicode, all the tests pass (without me having to change anythin in the name module, which means that normalize_name does return unicode.

dojson: clarify the API

From @jacquerie on April 21, 2017 10:22

Currently the API of dojson is part in https://github.com/inspirehep/inspire-next/tree/271c1a0936dd188c97f9fe7ae1b365ab0368ff25/inspirehep/dojson/utils, part in https://github.com/inspirehep/inspire-next/blob/271c1a0936dd188c97f9fe7ae1b365ab0368ff25/inspirehep/dojson/processors.py.

It should ideally all live in inspirehep/dojson/api.py, and all utils that are used elsewhere should be moved in a higher scope (for example: validate should live in inspire_schemas).

Copied from original issue: inspirehep/inspire-next#2248

hepnames: support ids in 701__i and 701__w

See:

@pytest.mark.xfail(reason='identifiers in i and w are not handled')
def test_advisors_from_701__a_g_i():
schema = load_schema('authors')
subschema = schema['properties']['advisors']
snippet = (
'<datafield tag="701" ind1=" " ind2=" ">'
' <subfield code="a">Rivelles, Victor O.</subfield>'
' <subfield code="g">PhD</subfield>'
' <subfield code="i">INSPIRE-00120420</subfield>'
' <subfield code="x">991627</subfield>'
' <subfield code="y">1</subfield>'
'</datafield>'
) # record/1474091/export/xme
expected = [
{
'name': 'Rivelles, Victor O.',
'degree_type': 'PhD',
'ids': [
{
'schema': 'INSPIRE ID',
'value': 'INSPIRE-00120420'
}
],
'record': {
'$ref': 'http://localhost:5000/api/authors/991627',
},
'curated_relation': True
},
]
result = hepnames.do(create_record(snippet))
assert validate(result['advisors'], subschema) is None
assert expected == result['advisors']
expected = [
{
'a': 'Rivelles, Victor O.',
'g': 'PhD',
'i': 'INSPIRE-00120420',
},
]
result = hepnames2marc.do(result)
assert expected == result['701']

missing test dependencies

After installing in a clean environment, the following dependencies were missing to run the tests (besides those installed by pip install -e .:
pytest, pytest-coverage, pytest-flake8.

hep: don't normalize again names in references

See:

@pytest.mark.xfail(reason="normalized names don't stay normalized")
def test_references_from_999C59_h_m_o_double_r_y():
schema = load_schema('hep')
subschema = schema['properties']['references']
snippet = (
'<datafield tag="999" ind1="C" ind2="5">'
' <subfield code="9">CURATOR</subfield>'
' <subfield code="h">Bennett, J</subfield>'
' <subfield code="m">Roger J. et al.</subfield>'
' <subfield code="o">9</subfield>'
' <subfield code="r">CERN-INTC-2004-016</subfield>'
' <subfield code="r">CERN-INTCP-186</subfield>'
' <subfield code="y">2004</subfield>'
'</datafield>'
) # record/1449990
expected = [
{
'reference': {
'authors': [
{'full_name': 'Bennett, J'},
],
'label': '9',
'misc': [
'Roger J. et al.',
],
'publication_info': {'year': 2004},
'report_numbers': [
'CERN-INTC-2004-016',
'CERN-INTCP-186',
],
},
},
]
result = hep.do(create_record(snippet))
assert validate(result['references'], subschema) is None
assert expected == result['references']
expected = [
{
'h': [
'Bennett, J',
],
'r': [
'CERN-INTCP-186',
'CERN-INTC-2004-016',
],
'm': 'Roger J. et al.',
'o': '9',
'y': 2004,
},
]
result = hep2marc.do(result)
assert expected == result['999C5']

multiple legacy_creation_dates crash marcxml2record

https://sentry.inspirehep.net/inspire-sentry/prod/issues/71500/

In HepNames some records preserve the creation dates of their ancestors that were merged, so subfield 961__x may be repeated.

000982164 961__ $$x1996-09-01$$x2006-04-21
000982164 961__ $$c2011-06-30$$c2013-03-09
000982182 961__ $$x2000-05-08$$x2008-06-30
000982182 961__ $$c2011-09-06$$c2009-06-07
000982514 961__ $$x2000-04-10$$x2008-02-14
000982514 961__ $$c2009-06-07$$c2013-04-08
000982535 961__ $$x1996-07-15$$x2008-07-25
000982535 961__ $$c2009-06-07
001005647 961__ $$x1992-06-25$$x1996-07-15
001005647 961__ $$c2009-06-07
001013833 961__ $$x1988-05-22$$x1990-05-28
001013833 961__ $$c2009-06-07

currently this fails conversion at

https://github.com/inspirehep/inspire-dojson/blob/master/inspire_dojson/common/rules.py#L873

/scratch/venvs/dojson/lib/python2.7/site-packages/inspire_dojson/common/rules.pyc in legacy_creation_date(self, key, value)
    871         return self['legacy_creation_date']
    872 
--> 873     return normalize_date(value.get('x'))
    874 
    875 

/scratch/venvs/dojson/lib/python2.7/site-packages/inspire_utils/date.pyc in normalize_date(date, **kwargs)
    230         return
    231 
--> 232     return PartialDate.parse(date, **kwargs).dumps()
    233 
    234 

/scratch/venvs/dojson/lib/python2.7/site-packages/inspire_utils/date.pyc in parse(cls, date, **kwargs)
    147         default_date2 = datetime.datetime(2, 2, 2)
    148 
--> 149         parsed_date1 = parse_date(date, default=default_date1, **kwargs)
    150         parsed_date2 = parse_date(date, default=default_date2, **kwargs)
    151 

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in parse(timestr, parserinfo, **kwargs)
   1310         return parser(parserinfo).parse(timestr, **kwargs)
   1311     else:
-> 1312         return DEFAULTPARSER.parse(timestr, **kwargs)
   1313 
   1314 

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    602                                                       second=0, microsecond=0)
    603 
--> 604         res, skipped_tokens = self._parse(timestr, **kwargs)
    605 
    606         if res is None:

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in _parse(self, timestr, dayfirst, yearfirst, fuzzy, fuzzy_with_tokens)
    678 
    679         res = self._result()
--> 680         l = _timelex.split(timestr)         # Splits the timestr into tokens
    681 
    682         skipped_idxs = []

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in split(cls, s)
    205     @classmethod
    206     def split(cls, s):
--> 207         return list(cls(s))
    208 
    209     @classmethod

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in __init__(self, instream)
     74         elif getattr(instream, 'read', None) is None:
     75             raise TypeError('Parser must be a string or character stream, not '
---> 76                             '{itype}'.format(itype=instream.__class__.__name__))
     77 
     78         self.instream = instream

TypeError: Parser must be a string or character stream, not tuple

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.