inspirehep / inspire-dojson Goto Github PK

View Code? Open in Web Editor NEW

3.0 20.0 17.0 1.38 MB

INSPIRE-specific rules to transform from MARCXML to JSON and back.

License: GNU General Public License v3.0

Python 99.89% Shell 0.11%

inspirehep python marcxml json

inspire-dojson's Introduction

INSPIRE-DoJSON

About

INSPIRE-specific rules to transform from MARCXML to JSON and back.

inspire-dojson's People

Contributors

Stargazers

Watchers

Forkers

david-caro michamos kaplun rikirenz jmartinm zzacharo iulianav ammirate glignos vbalbp szymonlopaciuk drjova salmanmaq monaawi pazembrz mjedr pascalegn

inspire-dojson's Issues

doJSON: export accelerator in 119__b for experiments

From @annetteholtkamp on March 15, 2017 13:9

A new subfield b in 119 has been defined for the accelerator. This has to be exported into the field "accelerator" of the experiment schema.

Copied from original issue: inspirehep/inspire-next#2038

pub info for books in journal field

From @annetteholtkamp on October 16, 2017 9:45

Expected Behavior

The publication info for books should appear in the 260 field

Current Behavior

Lots of older books have their pub info in 773__x, usually in the form
773__x:Cambridge, Uk: Univ. Pr. ( 1985) 376p
Sometimes series info appears in parentheses at the end, e.g.
Berlin, Germany: Suhrkamp (2012) 481 p, (Suhrkamp Taschenbuch Wissenschaft 2033)

Steps to Reproduce (for bugs)

tc b and not 260__b:** and 773__x:** and not 773__p:**
https://inspirehep.net/search?wl=0&ln=en&p=tc+b+and+not+260__b%3A**+and+773__x%3A**+and+not+773__p%3A**&of=hb&action_search=Search&sf=earliestdate&so=d&rm=&rg=250&sc=0

Context

Results e.g. in wrong bibtex entries.
It should be checked whether the info in 773__x is already present in 260, 300 and 490

Screenshots (if appropriate):

Copied from original issue: inspirehep/inspire-next#2864

dojson: more robust external_system_identifiers

The current implementation of external_system_identifier has some short comings:

currently considers only the first 035__a and 035__9 losing any additional value.
in case value is not set, it still adds an entry with the schema (producing invalid record).
exporting back to MARC associates to $a and $z only based on a whitelist of schemas.

So e.g. in record https://inspirehep.net/record/700376 there:

<datafield tag="035" ind1=" " ind2=" ">
<subfield code="9">OSTI</subfield>
<subfield code="a">892532</subfield>
</datafield>
<datafield tag="035" ind1=" " ind2=" ">
<subfield code="9">OSTI</subfield>
<subfield code="z">897192</subfield>
</datafield>

The second OSTI is currently lost.

dojson: handle obsolete ORCIDs

From @annetteholtkamp on March 24, 2017 9:15

In rare cases an author may have 2 Orcid's. Although we'll try to get these cases solved on the Orccid side we should be prepared to deal with them. Currently they are put into 035__z. In the new schema Orcid's should be unique. A second Orcid should be automatically hidden.

Copied from original issue: inspirehep/inspire-next#2129

docs: integrate Sphinx and ReadTheDocs

Because it's the right thing to do.

hep: split mashed up authors lists

See:

inspire-dojson/tests/test_hep_bd1xx.py

Lines 1190 to 1288 in 9179090

 @pytest.mark.xfail(reason='should split mashed up author list') 

 def test_authors_supervisors_from_100_a_u_w_y_z_and_701__double_a_u_z(): 

 schema = load_schema('hep') 

 subschema = schema['properties']['authors'] 

 snippet = ( 

 '<record>' 

 ' <datafield tag="100" ind1=" " ind2=" ">' 

 ' <subfield code="a">Lang, Brian W.</subfield>' 

 ' <subfield code="u">Minnesota U.</subfield>' 

 ' <subfield code="w">B.W.Lang.1</subfield>' 

 ' <subfield code="y">0</subfield>' 

 ' <subfield code="z">903010</subfield>' 

 ' </datafield>' 

 ' <datafield tag="701" ind1=" " ind2=" ">' 

 ' <subfield code="a">Poling, Ron</subfield>' 

 ' <subfield code="a">Kubota, Yuichi</subfield>' 

 ' <subfield code="u">Minnesota U.</subfield>' 

 ' <subfield code="z">903010</subfield>' 

 ' </datafield>' 

 '</record>' 

 ) # record/776962/export/xme 

 expected = [ 

 { 

 'affiliations': [ 

 { 

 'record': { 

 '$ref': 'http://localhost:5000/api/institutions/903010', 

 }, 

 'value': 'Minnesota U.', 

 }, 

 ], 

 'full_name': 'Lang, Brian W.', 

 'ids': [ 

 { 

 'schema': 'INSPIRE BAI', 

 'value': 'B.W.Lang.1', 

 }, 

 ], 

 }, 

 { 

 'affiliations': [ 

 { 

 'value': 'Minnesota U.', 

 'record': { 

 '$ref': 'http://localhost:5000/api/institutions/903010' 

 } 

 } 

 ], 

 'full_name': 'Poling, Ron', 

 'inspire_roles': [ 

 'supervisor', 

 ], 

 }, 

 { 

 'affiliations': [ 

 { 

 'value': 'Minnesota U.', 

 'record': { 

 '$ref': 'http://localhost:5000/api/institutions/903010', 

 }, 

 }, 

 ], 

 'full_name': 'Kubota, Yuichi', 

 'inspire_roles': [ 

 'supervisor', 

 ], 

 } 

 ] 

 result = hep.do(create_record(snippet)) 

 assert validate(result['authors'], subschema) is None 

 assert expected == result['authors'] 

 expected_100 = { 

 'a': 'Lang, Brian W.', 

 'u': [ 

 'Minnesota U.', 

 ], 

 } 

 expected_701 = [ 

 { 

 'a': 'Poling, Ron', 

 'u': [ 

 'Minnesota U.', 

 ], 

 }, 

 { 

 'a': 'Kubota, Yuichi', 

 'u': [ 

 'Minnesota U.', 

 ], 

 }, 

 ] 

 result = hep2marc.do(result) 

 assert expected_100 == result['100'] 

 assert expected_701 == result['701']

normalize ISBNs to ISBN13 without dashes

On legacy, various ISBN formats are used: ISBN 10 and ISBN 13, without separator or separated by spaces or dashes. They should all be normalized to ISBN 13 without separator, of the form "978123456789X". Care has to be taken that some ISBNs on legacy are invalid: those should be migrated as is (and logged if possible).

utils: move validation to inspire-schemas

Because it's the right thing to do.

add_inspire_categories is completely broken

reviewing the uses of classify_field, I discovered this:

inspire-dojson/inspire_dojson/hep/model.py

Lines 51 to 65 in a37eaa7

 def add_inspire_categories(record, blob): 

 if not record.get('arxiv_eprints') or record.get('inspire_categories'): 

 return record 

 for arxiv_category in force_list(get_value(record, 'arxiv_eprints.categories')): 

 inspire_category = classify_field(arxiv_category) 

 if inspire_category: 

 record['inspire_category'] = [ 

 { 

 'source': 'arxiv', 

 'term': inspire_category, 

 }, 

 ] 

 return record

It is completely broken as:

inspire_category does not exist in the schema (inspire_categories is the correct form);
it overwrites the output on every iteration.

I don't know if it is ever run, as it requires arxiv_eprints to be present but inspire_categories not to be, which should be pretty rare, and I have never seen any migration error because of this field.

It would be good to review the whole logic around categories in dojson as it's quite intricate.

global: support Python 3.6

Because it's the right thing to do.

hepnames: records with only a preferred_name

See: https://sentry.cern.ch/inspire-sentry/inspire-labs/group/822769/.

dojson: kill the geo utils

From @jacquerie on April 21, 2017 10:20

This entire file should be removed and replaced with a proper solution: https://github.com/inspirehep/inspire-next/blob/271c1a0936dd188c97f9fe7ae1b365ab0368ff25/inspirehep/dojson/utils/geo.py.

Copied from original issue: inspirehep/inspire-next#2247

refextract: texkey extraction

refextract extracts texkeys from pdf. These texkeys are the ones used by the author and are not necessarily Inspire texkeys. So the syntax of the extracted texkeys need to be checked before adding them.

global: remove dependency on invenio-utils

Same as inspirehep/inspire-next#2207.

hep: migrate accelerator information

If a record has a 693__a but no 693__e, that field should be migrated to acelerator_experiments.accelerator. Otherwise, the 693__a should be discarded.

bdFFT: Handle the case that `d` subfield is not defined in `FFT` datafield

Currently when d subfield is not defined the code is crushing on the attempt to match the caption pattern with a None value giving a TypeError: expected string or buffer.

cds: normalize author names

The convention regarding initials differs, so names need to be normalized.

dojson: kill the "current" MARCXML usage

From @jacquerie on April 21, 2017 10:23

https://github.com/inspirehep/inspire-next/tree/271c1a0936dd188c97f9fe7ae1b365ab0368ff25/inspirehep/dojson/current_marcxml_usage is not current anymore, and it didn't belong in a codebase anyway: it's data.

Copied from original issue: inspirehep/inspire-next#2249

Email with email: prefix

From @kaplun on June 27, 2017 14:56

@hoc3426 is adding email: in front of emails on legacy. We shall consider this in dojson script.

Copied from original issue: inspirehep/inspire-next#2497

hep: populate documents and figures from FFT

documents and figures should be populated instead of the _fft key from FFT.

utils: spin out jsonref utils to inspire-utils

These two guys:

inspire-dojson/inspire_dojson/utils/__init__.py

Lines 77 to 85 in 404f7ee

 def get_recid_from_ref(ref_obj): 

 """Retrieve recid from jsonref reference object. 

  If no recid can be parsed, returns None. 

  """ 

 if not isinstance(ref_obj, dict): 

 return None 

 url = ref_obj.get('$ref', '') 

 return maybe_int(url.split('/')[-1])

and

inspire-dojson/inspire_dojson/utils/__init__.py

Lines 101 to 109 in 404f7ee

 def get_record_ref(recid, endpoint='record'): 

 """Create record jsonref reference object from recid. 

  None recids will return a None object. 

  Valid recids will return an object in the form of: {'$ref': url_for_record} 

  """ 

 if recid is None: 

 return None 

 return {'$ref': absolute_url('/api/{}/{}'.format(endpoint, recid))}

The problem is that they depend on absolute_url, so this requires #57 to be fixed.

Is storing the absolute URLs in $refs a good idea?

From @jmartinm on July 5, 2016 14:50

At the moment, as far as I can see, both in the database and in ElasticSearch the $refs between records are expressed as absolute URLs. Talking to @david-caro the question came up of how, for example, could one load a dump of production locally and still have it working (if the references are absolute).

I have seen a couple of 'related' issues on Invenio:

inveniosoftware/invenio-records#117
inveniosoftware/invenio-jsonschemas#23

This same problem might happen, for instance, the moment we switch from labs.inspirehep.net to inspirehep.net, what will we do with the $refs?

Copied from original issue: inspirehep/inspire-next#1295

conferences should not populate postal_address

As noticed by @Dinika, for conferences we are populating in addresses not only cities, state and country, but also postal_address. The postal_address is the unparsed address coming from legacy, that is parsed to produce the other fields. This makes it redundant and it should not be used for conferences, as stated by the schema https://github.com/inspirehep/inspire-schemas/blob/cdfff3a630c7c313d4af0ec70ddd95a441044962/inspire_schemas/records/conferences.yml#L45-L46.

'authors.full_name' should be unicode

There might be a discrepancy on what type the full_name has when it leaves dojson.

Why

Well, firstly from @iulianav 's PR and the discussion here we got the first idea that this is happening (with her latest commit).

Secondly, after a discussion with @jacquerie IRL, we had the impression that it was inspire_utils.name::normalize_name that had the issue.

But then in this commit, after I converted all string literals in the expected_name from str to unicode, all the tests pass (without me having to change anythin in the name module, which means that normalize_name does return unicode.

dojson: clarify the API

From @jacquerie on April 21, 2017 10:22

Currently the API of dojson is part in https://github.com/inspirehep/inspire-next/tree/271c1a0936dd188c97f9fe7ae1b365ab0368ff25/inspirehep/dojson/utils, part in https://github.com/inspirehep/inspire-next/blob/271c1a0936dd188c97f9fe7ae1b365ab0368ff25/inspirehep/dojson/processors.py.

It should ideally all live in inspirehep/dojson/api.py, and all utils that are used elsewhere should be moved in a higher scope (for example: validate should live in inspire_schemas).

Copied from original issue: inspirehep/inspire-next#2248

Populate documents.source from FFT doctype

This is not currently done.

dojson: recognize more defense dates from 500__a

From @jacquerie on April 11, 2017 8:13

inspirehep/inspire-next#2052 expanded the regexp that we use to extract the defense date from the 500 MARC key, but is still missing cases like

<datafield tag="500" ind1=" " ind2=" ">
  <subfield code="a">Presented on Dec 1992</subfield>
</datafield>

from https://inspirehep.net/record/887715/export/xme.

Copied from original issue: inspirehep/inspire-next#2215

Translate shortened link titles during migration

8564__y sometimes contain abbreviations, that are expanded into their real meaning at display time.

This dubious design has been reproduced also on labs, with the weblinks.kb being used at display time to format links.

Instead, dojson should handle the translation when populating url descriptions.

hepnames: support ids in 701i and 701w

See:

inspire-dojson/tests/test_hepnames.py

Lines 841 to 886 in 9179090

 @pytest.mark.xfail(reason='identifiers in i and w are not handled') 

 def test_advisors_from_701__a_g_i(): 

 schema = load_schema('authors') 

 subschema = schema['properties']['advisors'] 

 snippet = ( 

 '<datafield tag="701" ind1=" " ind2=" ">' 

 ' <subfield code="a">Rivelles, Victor O.</subfield>' 

 ' <subfield code="g">PhD</subfield>' 

 ' <subfield code="i">INSPIRE-00120420</subfield>' 

 ' <subfield code="x">991627</subfield>' 

 ' <subfield code="y">1</subfield>' 

 '</datafield>' 

 ) # record/1474091/export/xme 

 expected = [ 

 { 

 'name': 'Rivelles, Victor O.', 

 'degree_type': 'PhD', 

 'ids': [ 

 { 

 'schema': 'INSPIRE ID', 

 'value': 'INSPIRE-00120420' 

 } 

 ], 

 'record': { 

 '$ref': 'http://localhost:5000/api/authors/991627', 

 }, 

 'curated_relation': True 

 }, 

 ] 

 result = hepnames.do(create_record(snippet)) 

 assert validate(result['advisors'], subschema) is None 

 assert expected == result['advisors'] 

 expected = [ 

 { 

 'a': 'Rivelles, Victor O.', 

 'g': 'PhD', 

 'i': 'INSPIRE-00120420', 

 }, 

 ] 

 result = hepnames2marc.do(result) 

 assert expected == result['701']

doJSON: convert old arXiv categories

From @michamos on June 28, 2017 10:14

Expected Behavior

Some arXiv categories have changed name and don't exist anymore, so we should translate them to the new name upon migration.

Current Behavior

arXiv categories are taken literally.

Copied from original issue: inspirehep/inspire-next#2501

missing test dependencies

After installing in a clean environment, the following dependencies were missing to run the tests (besides those installed by pip install -e .:
pytest, pytest-coverage, pytest-flake8.

hep: don't normalize again names in references

See:

inspire-dojson/tests/test_hep_bd9xx.py

Lines 808 to 864 in d1524e2

 @pytest.mark.xfail(reason="normalized names don't stay normalized") 

 def test_references_from_999C59_h_m_o_double_r_y(): 

 schema = load_schema('hep') 

 subschema = schema['properties']['references'] 

 snippet = ( 

 '<datafield tag="999" ind1="C" ind2="5">' 

 ' <subfield code="9">CURATOR</subfield>' 

 ' <subfield code="h">Bennett, J</subfield>' 

 ' <subfield code="m">Roger J. et al.</subfield>' 

 ' <subfield code="o">9</subfield>' 

 ' <subfield code="r">CERN-INTC-2004-016</subfield>' 

 ' <subfield code="r">CERN-INTCP-186</subfield>' 

 ' <subfield code="y">2004</subfield>' 

 '</datafield>' 

 ) # record/1449990 

 expected = [ 

 { 

 'reference': { 

 'authors': [ 

 {'full_name': 'Bennett, J'}, 

 ], 

 'label': '9', 

 'misc': [ 

 'Roger J. et al.', 

 ], 

 'publication_info': {'year': 2004}, 

 'report_numbers': [ 

 'CERN-INTC-2004-016', 

 'CERN-INTCP-186', 

 ], 

 }, 

 }, 

 ] 

 result = hep.do(create_record(snippet)) 

 assert validate(result['references'], subschema) is None 

 assert expected == result['references'] 

 expected = [ 

 { 

 'h': [ 

 'Bennett, J', 

 ], 

 'r': [ 

 'CERN-INTCP-186', 

 'CERN-INTC-2004-016', 

 ], 

 'm': 'Roger J. et al.', 

 'o': '9', 

 'y': 2004, 

 }, 

 ] 

 result = hep2marc.do(result) 

 assert expected == result['999C5']

dojson: corrigendum

https://sentry.cern.ch/inspire-sentry/inspire-labs/group/822798/

Records with material corrigendum:

<datafield tag="024" ind1="7" ind2=" ">
    <subfield code="2">DOI</subfield>
    <subfield code="q">Corrigendum</subfield>
    <subfield code="9">bibmatch</subfield>
    <subfield code="a">10.1016/j.physletb.2014.05.010</subfield>
</datafield>

should be conflated into erratum (see also inspirehep/inspire-schemas#195 (comment))

multiple legacy_creation_dates crash marcxml2record

https://sentry.inspirehep.net/inspire-sentry/prod/issues/71500/

In HepNames some records preserve the creation dates of their ancestors that were merged, so subfield 961__x may be repeated.

000982164 961__ $$x1996-09-01$$x2006-04-21
000982164 961__ $$c2011-06-30$$c2013-03-09
000982182 961__ $$x2000-05-08$$x2008-06-30
000982182 961__ $$c2011-09-06$$c2009-06-07
000982514 961__ $$x2000-04-10$$x2008-02-14
000982514 961__ $$c2009-06-07$$c2013-04-08
000982535 961__ $$x1996-07-15$$x2008-07-25
000982535 961__ $$c2009-06-07
001005647 961__ $$x1992-06-25$$x1996-07-15
001005647 961__ $$c2009-06-07
001013833 961__ $$x1988-05-22$$x1990-05-28
001013833 961__ $$c2009-06-07

currently this fails conversion at

https://github.com/inspirehep/inspire-dojson/blob/master/inspire_dojson/common/rules.py#L873

/scratch/venvs/dojson/lib/python2.7/site-packages/inspire_dojson/common/rules.pyc in legacy_creation_date(self, key, value)
    871         return self['legacy_creation_date']
    872 
--> 873     return normalize_date(value.get('x'))
    874 
    875 

/scratch/venvs/dojson/lib/python2.7/site-packages/inspire_utils/date.pyc in normalize_date(date, **kwargs)
    230         return
    231 
--> 232     return PartialDate.parse(date, **kwargs).dumps()
    233 
    234 

/scratch/venvs/dojson/lib/python2.7/site-packages/inspire_utils/date.pyc in parse(cls, date, **kwargs)
    147         default_date2 = datetime.datetime(2, 2, 2)
    148 
--> 149         parsed_date1 = parse_date(date, default=default_date1, **kwargs)
    150         parsed_date2 = parse_date(date, default=default_date2, **kwargs)
    151 

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in parse(timestr, parserinfo, **kwargs)
   1310         return parser(parserinfo).parse(timestr, **kwargs)
   1311     else:
-> 1312         return DEFAULTPARSER.parse(timestr, **kwargs)
   1313 
   1314 

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    602                                                       second=0, microsecond=0)
    603 
--> 604         res, skipped_tokens = self._parse(timestr, **kwargs)
    605 
    606         if res is None:

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in _parse(self, timestr, dayfirst, yearfirst, fuzzy, fuzzy_with_tokens)
    678 
    679         res = self._result()
--> 680         l = _timelex.split(timestr)         # Splits the timestr into tokens
    681 
    682         skipped_idxs = []

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in split(cls, s)
    205     @classmethod
    206     def split(cls, s):
--> 207         return list(cls(s))
    208 
    209     @classmethod

/scratch/venvs/dojson/lib/python2.7/site-packages/dateutil/parser/_parser.pyc in __init__(self, instream)
     74         elif getattr(instream, 'read', None) is None:
     75             raise TypeError('Parser must be a string or character stream, not '
---> 76                             '{itype}'.format(itype=instream.__class__.__name__))
     77 
     78         self.instream = instream

TypeError: Parser must be a string or character stream, not tuple

	@pytest.mark.xfail(reason='should split mashed up author list')
	def test_authors_supervisors_from_100_a_u_w_y_z_and_701__double_a_u_z():
	schema = load_schema('hep')
	subschema = schema['properties']['authors']

	snippet = (
	'<record>'
	' <datafield tag="100" ind1=" " ind2=" ">'
	' <subfield code="a">Lang, Brian W.</subfield>'
	' <subfield code="u">Minnesota U.</subfield>'
	' <subfield code="w">B.W.Lang.1</subfield>'
	' <subfield code="y">0</subfield>'
	' <subfield code="z">903010</subfield>'
	' </datafield>'
	' <datafield tag="701" ind1=" " ind2=" ">'
	' <subfield code="a">Poling, Ron</subfield>'
	' <subfield code="a">Kubota, Yuichi</subfield>'
	' <subfield code="u">Minnesota U.</subfield>'
	' <subfield code="z">903010</subfield>'
	' </datafield>'
	'</record>'
	) # record/776962/export/xme

	expected = [
	{
	'affiliations': [
	{
	'record': {
	'$ref': 'http://localhost:5000/api/institutions/903010',
	},
	'value': 'Minnesota U.',
	},
	],
	'full_name': 'Lang, Brian W.',
	'ids': [
	{
	'schema': 'INSPIRE BAI',
	'value': 'B.W.Lang.1',
	},
	],
	},
	{
	'affiliations': [
	{
	'value': 'Minnesota U.',
	'record': {
	'$ref': 'http://localhost:5000/api/institutions/903010'
	}
	}
	],
	'full_name': 'Poling, Ron',
	'inspire_roles': [
	'supervisor',
	],
	},
	{
	'affiliations': [
	{
	'value': 'Minnesota U.',
	'record': {
	'$ref': 'http://localhost:5000/api/institutions/903010',
	},
	},
	],
	'full_name': 'Kubota, Yuichi',
	'inspire_roles': [
	'supervisor',
	],
	}
	]
	result = hep.do(create_record(snippet))

	assert validate(result['authors'], subschema) is None
	assert expected == result['authors']

	expected_100 = {
	'a': 'Lang, Brian W.',
	'u': [
	'Minnesota U.',
	],
	}
	expected_701 = [
	{
	'a': 'Poling, Ron',
	'u': [
	'Minnesota U.',
	],
	},
	{
	'a': 'Kubota, Yuichi',
	'u': [
	'Minnesota U.',
	],
	},
	]
	result = hep2marc.do(result)

	assert expected_100 == result['100']
	assert expected_701 == result['701']

	def add_inspire_categories(record, blob):
	if not record.get('arxiv_eprints') or record.get('inspire_categories'):
	return record

	for arxiv_category in force_list(get_value(record, 'arxiv_eprints.categories')):
	inspire_category = classify_field(arxiv_category)
	if inspire_category:
	record['inspire_category'] = [
	{
	'source': 'arxiv',
	'term': inspire_category,
	},
	]

	return record

	def get_recid_from_ref(ref_obj):
	"""Retrieve recid from jsonref reference object.

	If no recid can be parsed, returns None.
	"""
	if not isinstance(ref_obj, dict):
	return None
	url = ref_obj.get('$ref', '')
	return maybe_int(url.split('/')[-1])

	def get_record_ref(recid, endpoint='record'):
	"""Create record jsonref reference object from recid.

	None recids will return a None object.
	Valid recids will return an object in the form of: {'$ref': url_for_record}
	"""
	if recid is None:
	return None
	return {'$ref': absolute_url('/api/{}/{}'.format(endpoint, recid))}

	@pytest.mark.xfail(reason='identifiers in i and w are not handled')
	def test_advisors_from_701__a_g_i():
	schema = load_schema('authors')
	subschema = schema['properties']['advisors']

	snippet = (
	'<datafield tag="701" ind1=" " ind2=" ">'
	' <subfield code="a">Rivelles, Victor O.</subfield>'
	' <subfield code="g">PhD</subfield>'
	' <subfield code="i">INSPIRE-00120420</subfield>'
	' <subfield code="x">991627</subfield>'
	' <subfield code="y">1</subfield>'
	'</datafield>'
	) # record/1474091/export/xme

	expected = [
	{
	'name': 'Rivelles, Victor O.',
	'degree_type': 'PhD',
	'ids': [
	{
	'schema': 'INSPIRE ID',
	'value': 'INSPIRE-00120420'
	}
	],
	'record': {
	'$ref': 'http://localhost:5000/api/authors/991627',
	},
	'curated_relation': True
	},
	]
	result = hepnames.do(create_record(snippet))

	assert validate(result['advisors'], subschema) is None
	assert expected == result['advisors']

	expected = [
	{
	'a': 'Rivelles, Victor O.',
	'g': 'PhD',
	'i': 'INSPIRE-00120420',
	},
	]
	result = hepnames2marc.do(result)

	assert expected == result['701']

	@pytest.mark.xfail(reason="normalized names don't stay normalized")
	def test_references_from_999C59_h_m_o_double_r_y():
	schema = load_schema('hep')
	subschema = schema['properties']['references']

	snippet = (
	'<datafield tag="999" ind1="C" ind2="5">'
	' <subfield code="9">CURATOR</subfield>'
	' <subfield code="h">Bennett, J</subfield>'
	' <subfield code="m">Roger J. et al.</subfield>'
	' <subfield code="o">9</subfield>'
	' <subfield code="r">CERN-INTC-2004-016</subfield>'
	' <subfield code="r">CERN-INTCP-186</subfield>'
	' <subfield code="y">2004</subfield>'
	'</datafield>'
	) # record/1449990

	expected = [
	{
	'reference': {
	'authors': [
	{'full_name': 'Bennett, J'},
	],
	'label': '9',
	'misc': [
	'Roger J. et al.',
	],
	'publication_info': {'year': 2004},
	'report_numbers': [
	'CERN-INTC-2004-016',
	'CERN-INTCP-186',
	],
	},
	},
	]
	result = hep.do(create_record(snippet))

	assert validate(result['references'], subschema) is None
	assert expected == result['references']

	expected = [
	{
	'h': [
	'Bennett, J',
	],
	'r': [
	'CERN-INTCP-186',
	'CERN-INTC-2004-016',
	],
	'm': 'Roger J. et al.',
	'o': '9',
	'y': 2004,
	},
	]
	result = hep2marc.do(result)

	assert expected == result['999C5']

inspirehep / inspire-dojson Goto Github PK

inspire-dojson's Introduction

INSPIRE-DoJSON

About

inspire-dojson's People

Contributors

Stargazers

Watchers

Forkers

inspire-dojson's Issues

Expected Behavior

Current Behavior

Steps to Reproduce (for bugs)

Context

Screenshots (if appropriate):

Why

Expected Behavior

Current Behavior

Recommend Projects

Recommend Topics

Recommend Org

Jobs