natlibfi / bib-rdf-pipeline Goto Github PK

Scripts and configuration for converting MARC bibliographic records into RDF

License: Creative Commons Zero v1.0 Universal

Makefile 6.88% Shell 79.29% Python 13.84%

bib-rdf-pipeline's Introduction

bib-rdf-pipeline

This repository contains various scripts and configuration for converting MARC bibliographic records into RDF, for use at the National Library of Finland.

The main component is a conversion pipeline driven by a Makefile that defines rules for realizing the conversion steps using command line tools.

The steps of the conversion are:

Start with a file of MARC records in Aleph sequential format
Split the file into smaller batches
Preprocess using unix tools such as grep and sed, to remove some local peculiarities
Convert to MARCXML and enrich the MARC records, using Catmandu
Run the Library of Congress marc2bibframe2 XSLT conversion from MARC to BIBFRAME RDF
Convert the BIBFRAME RDF/XML data into N-Triples format and fix up some bad URIs
Calculate work keys (e.g. author+title combination) used later for merging data about the same creative work
Convert the BIBFRAME data into Schema.org RDF in N-Triples format
Reconcile entities in the Schema.org data against external sources (e.g. YSA/YSO, Corporate names authority, RDA vocabularies)
Merge the Schema.org data about the same works
Calculate agent keys used for merging data about the same agent (person or organization)
Merge the agents based on agent keys
Convert the raw Schema.org data to HDT format so the full data set can be queried with SPARQL from the command line
Consolidate the data by e.g. rewriting URIs and moving subjects into the original work
Convert the consolidated data to HDT
??? (TBD)
Profit!

Dependencies

Command line tools are assumed to be available in $PATH, but the paths can be overridden on the make command line, e.g. make CATMANDU=/opt/catmandu

For running the main suite

Apache Jena command line utilities sparql and rsparql
Catmandu utility catmandu
uconv utility from Ubuntu package icu-devtools
xsltproc utility from Ubuntu package xsltproc
hdt-cpp command line utilities rdf2hdt and hdtSearch
hdt-java command line utility hdtsparql.sh

For running the unit tests

In addition to above:

bats in $PATH
xmllint utility from Ubuntu package libxml2-utils in $PATH

bib-rdf-pipeline's People

Contributors

Stargazers

Watchers

Forkers

emulatingkat zschoenb kirkhess captsolo asanchez75

bib-rdf-pipeline's Issues

Structured page counts are not valid in schema.org

Our MARC records have structured page counts, e.g. vii, 89, 31 s.. However, Schema.org only defines a single integer field schema:numberOfPages so the structured values are not really valid Schema.org.

Maybe we should convert those structured counts into a single number? It can't be done in SPARQL easily (Roman numerals!) but a relatively simple filter script (e.g. Python) could do it.

Express RDA media types in schema.org RDF

Similar to content and carrier types (#14)

Model separate volumes of periodicals

Currently series membership (mainly from 830 fields) does not make use of volume number information. We should model the periodicals in more detail in the Schema.org output, probably using PublicationVolume.

See #45 where this surfaced

Include part information in schema.org titles

For records whose title has numbered parts (245 $n / $p), those should be included in the schema:name. Currently they are lost. E.g. this one:

000770276 24510 L $$aKootut teokset.$$n3,$$pNäytelmiä: Olviretki Schleusingenissä ; Leo ja Liisa ; Canzino ; Selman juonet ; Alma /$$cAleksis Kivi ; julk. E. A. Saarimaa.

The part information is already used for the work keys.

Use newer hdt-cpp for Travis tests

Currently we are stuck at a specific hdt-cpp version for Travis tests - the one just before the builtin N-Triples parser was removed from hdt-cpp. The reason is that building and using the serd library turned out to be difficult in the Travis environment. However, this should eventually be fixed, we cannot rely on an outdated hdt-cpp forever.

https://github.com/NatLibFi/bib-rdf-pipeline/blob/master/.travis.yml#L32

Reconcile: YSA to YSO subject conversion fails

See e.g. this Travis build log:
https://travis-ci.org/NatLibFi/bib-rdf-pipeline/builds/232716762

This probably happens because YSA/YSO recently switched to skos:exactMatch mapping relationships.

Convert 600$t work subjects

600 fields with subfield t should be converted to schema:about <referred Work>

Inconsistent whitespace in work keys

The current work keys code generates work keys with inconsistent whitespace:

<http://urn.fi/URN:NBN:fi:bib:fennica:006414953> <http://purl.org/dc/terms/identifier> "vänrikki stoolin tarinat kokoelma runoja /runeberg, johan ludvig" .
<http://urn.fi/URN:NBN:fi:bib:fennica:006417535> <http://purl.org/dc/terms/identifier> "vänrikki stoolin tarinat  kokoelma runoja/runeberg, johan ludvig" .

There is recurring whitespace (two adjacent spaces) as well as trailing whitespace in the title part.
Has to be fixed before #6 can be fixed.

Avoid using work URIs generated from 700 fields, if possible

Currently the URI for a merged Work entity is selected by sorting the work URIs included in the bundle lexically, and taking the first one. This means that Works resulting from records with low ID numbers that contain 700 references (e.g. 000267913) will be preferred over higher-numbered records that are about the work itself.

The URI selection should be tuned so that work URIs generated from 700 references are avoided, if possible.

Filter prepublication records

We should filter (skip) records which have prepublication status, i.e. either of these is true:

Leader position 17 "Encoding level" is 8 (prepublication)
500 note that starts with "ennakkotieto" (case-insensitive)

Need a unit test that checks for these.

Make create-work-transformations transitive

As noted in #5, the create-work-transformations SPARQL query should consider indirect sharing of work keys. So e.g. if

workA hasKey "key1" .
workB hasKey "key1", "key2" .
workC hasKey "key2" .

=> A, B and C would all be collapsed into a single work.

Need a unit test to check for this.

Work key generation breaks on record with duplicate 130 values

In a Melinda/Fennica record there are two 130 fields:

000614665 1300  L $$aSionin wirret (1802)
000614665 1300  L $$aSions sånger.

This breaks work key generation:

11:23:10 ERROR riot                 :: [line: 99, col: 21] {E201} Multiple children of property element
Failed to load data
../Makefile:72: recipe for target 'slices/sioninwirret-00061-work-keys.nt' failed

This is really a bug in the data, as 130 fields should not be repeated, but nevertheless, it shouldn't break the conversion pipeline.

Trailing semicolon in series name

In origwork-00041-schema.nt, the series name Braille-neuvottelukunnan julkaisuja ; has an extra trailing space and semicolon.

Series should be considered Works too

Currently series are converted to schema:CreativeWorkSeries even though they are bf:Work instances in the BIBFRAME output. But there is no real separation between Works and Series. Some MARC records (e.g. 000075431) represent series, with other records (e.g. 000067680) being part of that series.

The work key calculation and merging already seems to work for pairs like these. The actual series work (from the first record above) is merged with the series work generated from the 810 field in the second record.

What needs to be fixed is that

the BF to Schema conversion needs to set schema:CreativeWork and bf:Work types for series generated from 810 or 830 fields (however, still keep schema:CreativeWorkSeries and optionally schema:Periodical types too)
series URIs should not use a separate type ID (S), but just use the normal Work URI scheme (W)

Common project for Bibframe2 SPARQL

Can we extract some SPARQL code into a generic Bibframe2 project for SPARQL?

https://github.com/NatLibFi/bib-rdf-pipeline/tree/master/sparql

Could we do this under the umbrella of an LC and/or LD4L/LD4P github org? If so, how would this project integrate an upstream project for SPARQL code? How would this project manage generally useful SPARQL on Bibframe2 vs. some SPARQL that is very specific to this project?

use coalesce()

https://github.com/NatLibFi/bib-rdf-pipeline/blob/master/sparql/bf-to-schema.rq#L134
coalesce() will simplify this line significantly

Express RDA content and carrier types in schema.org data

We have RDA content and carrier type information in the MARC records, expressed using official RDA terms (in Finnish). These should be propagated to the schema.org output.

Nowadays the official RDA vocabularies contain Finnish labels, so it should be fairly straightforward to map these to the RDA Vocabulary URIs, similar to what we do with YSA terms.

Not sure what the best way is to link to these URIs. The SchemaBibEx Content-Carrier wiki page recommends using additional types, but as these are SKOS concepts, I'm not sure that's the right way.
Another option would be to use the RDA unconstrained properties, e.g. rdau:P60049 "has content type" and rdau:P60048 "has carrier type".

Convert original of translation from 765 $t

Original of translation from 765 $s (uniform title) is converted properly, but not from 765 $t (title).

Switch to new marc2bibframe2 converter

Library of Congress and Index Data have released the new MARC to BIBFRAME2 converter:
https://github.com/lcnetdev/marc2bibframe2

The converter looks very promising and based on initial tests it would seem feasible to switch to using it instead of the old BIBFRAME1 converter marc2bibframe. Also marc2bibframe-wrapper is not needed if we run the converter using xsltproc, as the documentation suggests.

EBook instances assigned blank nodes instead of URIs in Schema.org conversion

An example of this is Melinda record 000006753

Organization type gets dropped when reconciling

There are schema:Organization instances in the original Schema.org output, but after the reconciliation step none are left.

Convert publisher locations

At the moment publisher locations are not converted to the Schema.org representation at all.

Reconcile publishers with identifiers

Similar to people (#18), we should reconcile organizations such as publishers and use URIs for them. We could reconcile against Finnish Corporate Names published in Finto.

Consolidate subjects (schema:about) between works

Curretly subjects get attached to some works, but not all. For example, for translated works, subjects are attached to the translation work but not the original work.

In the consolidate step subjects should be propagated between original and translated works, both ways. Also two different translations of the same work should end up with the same subjects.

Get missing original work titles (240) from 130, or 500 notes

Fennica has around 50000 records of translated works (with 041$h) without the original work title in a 240 field. The marc2bibframe converter does not create original work entities for these.

We should try to enrich the MARC records before the RDF conversion using the following pseudocode logic:

Preconditions: If the records is a translation (has 041$h subfield) AND if there is no 240 field,

If there is a 130 field, copy its value to 240
Else if there is a 500 note with the substring "alkuteos:" or "alkuteos :" (case-insensitive), then copy the remainder of the note to 240

Need a unit test and a few example records exercising the different cases.

Split off reconciliation from Schema.org conversion

Currently the same SPARQL query (bf-to-schema.rq) is used for converting from BIBFRAME to Schema.org and for reconciling entities. This is getting difficult to maintain. It should be split to a separate SPARQL CONSTRUCT query. Doing it like this would also allow inserting some non-SPARQL cleanup/normalization code in between these phases, if necessary.

Reconciliation unit tests have already been split off the Schema.org unit tests.

Work key transformation is too slow

The fix for #35 caused a massive slowdown of the work transformation step. It used to take about 2 minutes with Fennica data, now it has been running for nearly 2 hours and still has not even produced any results.

Probably the step has to be reimplemented in something else than SPARQL, because with SPARQL it's just unbearably slow and there is no obvious way of optimizing the create-work-transformations.rq query while keeping the current behavior.

Convert corporate name subjects (610)

Organizations / corporate names as subject (MARC field 610) are currently not being converted to BIBFRAME and therefore not to Schema.org either, e.g. in the ekumeeninen-00585 example record used in unit tests. Need to investigate.

Conversion to Schema.org is slow

Conversion from BIBFRAME to Schema.org used to take less than an hour for the whole Fennica data set, using all four cores on the linkeddata-kk VM. Now after the latest changes it takes 3 hours.

It's possible that this is inevitable, just because the complexity of the bf-to-sparql.rq query has increased. But it should be investigated, maybe there is a simple way of making it faster.

Distinguish different translations

Currently we merge Works that are actually different translations of the same work, because the work-key for a translated work includes only the original title and target language, but not the name of the translator.

For example, there are at least three Swedish translations of Aleksis Kivi's "Seitsemän veljestä", by Diktonius, Laurén and Warburton, and an abbreviated translation by Rostén. All these translations get merged to a single CreativeWork, which is wrong. They should be separated.

The problem is that we don't have very complete information about translators. However, if we just generate multiple work keys based on all contributors (MARC field 700), then the sets should hopefully not overlap and we should get the correct number of CreativeWorks in most cases (though there may be errors in the other contributors, for example preface authors).

Use RDFUnit for testing schema.org output

As suggested by @VladimirAlexiev, we could use RDFUnit in the unit test suite. An ideal use would be checking the schema.org output to make sure it matches schema.org conventions.

Use more specific CreativeWork types

Currently all schema.org bibliographic entities are typed as schema:CreativeWork and instance level entities also as schema:Book. The latter is wrong in some cases. We should use the correct, more specific schema.org CreativeWork subtypes.

Uncertain or inferred years are not valid in Schema.org

The MARC records sometimes have uncertain or inferred years, e.g. 1984? or [1850] or other special values such as year ranges. These are not valid values for the schema:datePublished value. Should we dumb them down to a valid value or not? This would inevitably lose some precision.

505 $u causes RDF/XML syntax breakage and subsequent errors

I get this riot error:

14:36:24 ERROR riot                 :: [line: 574716, col: 78] {E205} rdf:resource is not allowed as an element tag here.

for this data:

006835912 50580 L $$tViemäreiden sisäpuoliset saneerausmenetelmät : raportti /$$rKaunisto, Tuija ja$$rPelto-Huikko, Aino$$gs. 1-129.$$uhttp://urn.fi/URN:ISBN:978-951-633-128-0$$9FENNI<KEEP>

which was converted into this RDF/XML:

   <bf:Work xmlns:bf="http://bibframe.org/vocab/"
            rdf:about="http://urn.fi/URN:NBN:fi:bib:fennica:006835912work28">
      <bf:creator rdf:resource="http://urn.fi/URN:NBN:fi:bib:fennica:006835912agent30"/>
      <bf:creator rdf:resource="http://urn.fi/URN:NBN:fi:bib:fennica:006835912agent31"/>
      <rdf:resource rdf:resource="http://urn.fi/URN:ISBN:978-951-633-128-0"/>
      <bf:title>Viemäreiden sisäpuoliset saneerausmenetelmät : raportti</bf:title>
   </bf:Work>

The problem is that rdf:resource is used as XML element name i.e. RDF property, and Jena doesn't like that. It's really a bug in the marc2bibframe converter (see this code) but I don't expect anyone to fix that. Instead a little post-processing of marc2bibframe output could provide a stopgap so that further processing steps (work-keys and schema) don't break because of this.

Both Works and Instances have authors, but only Works have contributors

Authors are assigned for both Works and Instances, but contributors are assigned only for Works.

Either assign both authors and contributors to Works only, or then assign them for both Works and Instances.

Cache hdt-cpp and hdt-java dependencies

Currently we have to rebuild hdt-cpp and hdt-java in Travis CI every time we run tests, which takes more than 3 minutes. If we could cache the git source trees, that would save a lot of time, while still staying up to date with the current hdt-cpp and hdt-java versions.

Rewrite marc2bibframe2-generated URIs

I'm not 100% happy with the URIs that marc2bibframe2 generates, e.g. http://urn.fi/URN:NBN:fi:bib:me:000095841#Instance and http://urn.fi/URN:NBN:fi:bib:me:000146854#Work765-40. They provide uniqueness and relative stability, but they are rather long and also sometimes confusing, e.g. xxxx#Work830-xx is used for what is essentially a Series entity. Also the sequence numbers (position within MARC record) are fragile - inserting or deleting an otherwise unrelated field in the middle of a record will change them.

The new URIs should consist of:

a base URI, just like currently
a localname, consisting of:
- Single letter identifiying the type of entity (W, I, S, A = Work, Instance, Series or Agent)
- the record number from which the entity was generated
- two-digit sequence number distinguishing multiple entities of the same type from the same record:
  - 00 = the Work or Instance representing the whole record (i.e. what is currently just xxxx#Work or xxxx#Instance)
  - 01, 02, 03... = sequential numbering of entities of the same type from the same record, with 01 indicating the first such entity from the record, 02 the second etc.

The URI rewriting must happen after all the merging has been done, since only then do we have an accurate global view of all the URIs that exist within the dataset.

Strip birth/death years from recent authors

Due to data protection concerns, we should strip personal information i.e. birth and death years from living or recently deceased authors and contributors (i.e. people mentioned in 100, 600 and 700 fields). Even people with identical names can still be uniquely identified once we get the people identifiers in place (#18).

According to our lawyer, personal information of people who have been dead for 25 years is not protected in Finland (currently - this is going to change in 2018 with the new EU data protection directive, which limits data protection to the living only). Based on that, we could leave the information in place for those people who have died before 1990, or if the death date is unknown, those born (or otherwise known to have lived) before 1870.

In practice this could probably be implemented as a Fix script for Catmandu.

Prefer work URIs generated from 765 field over others (e.g. from 600)

A record generally has only one 765 field, though there are some exceptions. Work URIs generated from 765 fields are thus "more unique" than ones generated from e.g. 600 fields, of which there may be many in a record. Such 765-generated work URIs should be preferred when selecting a work URI for a merged work cluster (but not over original work URIs generated for the record itself).

Separate type IDs for Person and Organization types

In the URI rewriting (#43) all Agents use the type ID A. However, it would make sense to split that into P for Person and O for Organization. This way the URIs should be even more stable.

Travis builds error a lot

About 1 of 2 Travis builds fail with an error, claiming 10 minutes of inactivity. Should investigate

Add bf:Work and bf:Instance types

It's a bit of a step backwards for Schema.org conversion, but I think it would make sense to add additional types for Works and Instances, since schema:CreativeWork can be used to represent both.
Since BIBFRAME 2.0 is the current best source for Work/Instance types, I think bf:Work and bf:Instance could be added as types even during/after Schema.org conversion.

Reconcile people with identifiers

Currently people mentioned in e.g. 100 and 700 fields are converted to resources by marc2bibframe and there is no reconciliation, so there is lots of duplication even from within the same MARC record. We should instead link to URIs for people in the output.

The problem is that currently we don't have the person authority record IDs in the bibliographic records. This is being worked on on the Aleph side. Probably it makes sense to wait for that instead of doing our own reconciliation.

Comma in contributor name

In origwork-00004-schema.nt, the translator name Aho, Oili, has an extra trailing comma.

Split Aleph sequence preprocessing from MARCXML conversion

Currently the whole chain from (split) Aleph sequence to MARCXML is handled by one target in the Makefile, mrcx. This works fine for now, but makes it difficult to debug the individual substeps or to add new substeps into the process.

The alephseq preprocessing should be split into a separate step, so that the mrcx step only involves running Catmandu with suitable options and Fix scripts.

Express manufacturer in Schema.org output

E.g. the example/test record ekumeeninen-00585 has the manufacturer "Saarijärven Offset" but this is not expressed in the current Schema.org output.

simplify create-work-transformations.rq

https://github.com/NatLibFi/bib-rdf-pipeline/blob/master/sparql/create-work-transformations.rq

Isn't this simpler and faster:

PREFIX dct: <http://purl.org/dc/terms/>
PREFIX schema: <http://schema.org/>

CONSTRUCT {
  ?s schema:sameAs ?repl .
} WHERE {
  ?s dct:identifier ?key.
  ?repl dct:identifier ?key.
  FILTER (STR(?s) > STR(?repl))
}

Deal with incorrect RDA Carrier categories

There are several supposedly RDA Carrier categories in the metadata that don't actually match the official values. See the breakdown in #15. For example, Digitaalinen jäljenne does not exist in RDA

Slicing fails if input file has changed

If the input file (input/*.alephseq) changes, running make slice does not always produce a good result. For example, changing the first record in the unit test input test/input/slice.alephseq will produce a file that contains both the old and new versions of that record. This is related to the way small slices are merged into larger batches for efficiency.

Also need unit tests to guard against problems like this.

Represent organizations as authors/contributors

The current bf-to-schema query transforms organization authors (MARC field 110) into schema:Person instances, which is not right. Should be schema:Organization.