GithubHelp home page GithubHelp logo

natlibfi / bib-rdf-pipeline Goto Github PK

View Code? Open in Web Editor NEW
29.0 16.0 5.0 5.71 MB

Scripts and configuration for converting MARC bibliographic records into RDF

License: Creative Commons Zero v1.0 Universal

Makefile 6.88% Shell 79.29% Python 13.84%
marc rdf conversion code4lib

bib-rdf-pipeline's Introduction

Build Status

bib-rdf-pipeline

This repository contains various scripts and configuration for converting MARC bibliographic records into RDF, for use at the National Library of Finland.

The main component is a conversion pipeline driven by a Makefile that defines rules for realizing the conversion steps using command line tools.

The steps of the conversion are:

  1. Start with a file of MARC records in Aleph sequential format
  2. Split the file into smaller batches
  3. Preprocess using unix tools such as grep and sed, to remove some local peculiarities
  4. Convert to MARCXML and enrich the MARC records, using Catmandu
  5. Run the Library of Congress marc2bibframe2 XSLT conversion from MARC to BIBFRAME RDF
  6. Convert the BIBFRAME RDF/XML data into N-Triples format and fix up some bad URIs
  7. Calculate work keys (e.g. author+title combination) used later for merging data about the same creative work
  8. Convert the BIBFRAME data into Schema.org RDF in N-Triples format
  9. Reconcile entities in the Schema.org data against external sources (e.g. YSA/YSO, Corporate names authority, RDA vocabularies)
  10. Merge the Schema.org data about the same works
  11. Calculate agent keys used for merging data about the same agent (person or organization)
  12. Merge the agents based on agent keys
  13. Convert the raw Schema.org data to HDT format so the full data set can be queried with SPARQL from the command line
  14. Consolidate the data by e.g. rewriting URIs and moving subjects into the original work
  15. Convert the consolidated data to HDT
  16. ??? (TBD)
  17. Profit!

Dependencies

Command line tools are assumed to be available in $PATH, but the paths can be overridden on the make command line, e.g. make CATMANDU=/opt/catmandu

For running the main suite

  • Apache Jena command line utilities sparql and rsparql
  • Catmandu utility catmandu
  • uconv utility from Ubuntu package icu-devtools
  • xsltproc utility from Ubuntu package xsltproc
  • hdt-cpp command line utilities rdf2hdt and hdtSearch
  • hdt-java command line utility hdtsparql.sh

For running the unit tests

In addition to above:

  • bats in $PATH
  • xmllint utility from Ubuntu package libxml2-utils in $PATH

bib-rdf-pipeline's People

Contributors

osma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bib-rdf-pipeline's Issues

Structured page counts are not valid in schema.org

Our MARC records have structured page counts, e.g. vii, 89, 31 s.. However, Schema.org only defines a single integer field schema:numberOfPages so the structured values are not really valid Schema.org.

Maybe we should convert those structured counts into a single number? It can't be done in SPARQL easily (Roman numerals!) but a relatively simple filter script (e.g. Python) could do it.

Include part information in schema.org titles

For records whose title has numbered parts (245 $n / $p), those should be included in the schema:name. Currently they are lost. E.g. this one:

000770276 24510 L $$aKootut teokset.$$n3,$$pNäytelmiä: Olviretki Schleusingenissä ; Leo ja Liisa ; Canzino ; Selman juonet ; Alma /$$cAleksis Kivi ; julk. E. A. Saarimaa.

The part information is already used for the work keys.

Inconsistent whitespace in work keys

The current work keys code generates work keys with inconsistent whitespace:

<http://urn.fi/URN:NBN:fi:bib:fennica:006414953> <http://purl.org/dc/terms/identifier> "vänrikki stoolin tarinat kokoelma runoja /runeberg, johan ludvig" .
<http://urn.fi/URN:NBN:fi:bib:fennica:006417535> <http://purl.org/dc/terms/identifier> "vänrikki stoolin tarinat  kokoelma runoja/runeberg, johan ludvig" .

There is recurring whitespace (two adjacent spaces) as well as trailing whitespace in the title part.
Has to be fixed before #6 can be fixed.

Avoid using work URIs generated from 700 fields, if possible

Currently the URI for a merged Work entity is selected by sorting the work URIs included in the bundle lexically, and taking the first one. This means that Works resulting from records with low ID numbers that contain 700 references (e.g. 000267913) will be preferred over higher-numbered records that are about the work itself.

The URI selection should be tuned so that work URIs generated from 700 references are avoided, if possible.

Filter prepublication records

We should filter (skip) records which have prepublication status, i.e. either of these is true:

  1. Leader position 17 "Encoding level" is 8 (prepublication)
  2. 500 note that starts with "ennakkotieto" (case-insensitive)

Need a unit test that checks for these.

Make create-work-transformations transitive

As noted in #5, the create-work-transformations SPARQL query should consider indirect sharing of work keys. So e.g. if

workA hasKey "key1" .
workB hasKey "key1", "key2" .
workC hasKey "key2" .

=> A, B and C would all be collapsed into a single work.

Need a unit test to check for this.

Work key generation breaks on record with duplicate 130 values

In a Melinda/Fennica record there are two 130 fields:

000614665 1300  L $$aSionin wirret (1802)
000614665 1300  L $$aSions sånger.

This breaks work key generation:

11:23:10 ERROR riot                 :: [line: 99, col: 21] {E201} Multiple children of property element
Failed to load data
../Makefile:72: recipe for target 'slices/sioninwirret-00061-work-keys.nt' failed

This is really a bug in the data, as 130 fields should not be repeated, but nevertheless, it shouldn't break the conversion pipeline.

Series should be considered Works too

Currently series are converted to schema:CreativeWorkSeries even though they are bf:Work instances in the BIBFRAME output. But there is no real separation between Works and Series. Some MARC records (e.g. 000075431) represent series, with other records (e.g. 000067680) being part of that series.

The work key calculation and merging already seems to work for pairs like these. The actual series work (from the first record above) is merged with the series work generated from the 810 field in the second record.

What needs to be fixed is that

  • the BF to Schema conversion needs to set schema:CreativeWork and bf:Work types for series generated from 810 or 830 fields (however, still keep schema:CreativeWorkSeries and optionally schema:Periodical types too)
  • series URIs should not use a separate type ID (S), but just use the normal Work URI scheme (W)

Express RDA content and carrier types in schema.org data

We have RDA content and carrier type information in the MARC records, expressed using official RDA terms (in Finnish). These should be propagated to the schema.org output.

Nowadays the official RDA vocabularies contain Finnish labels, so it should be fairly straightforward to map these to the RDA Vocabulary URIs, similar to what we do with YSA terms.

Not sure what the best way is to link to these URIs. The SchemaBibEx Content-Carrier wiki page recommends using additional types, but as these are SKOS concepts, I'm not sure that's the right way.
Another option would be to use the RDA unconstrained properties, e.g. rdau:P60049 "has content type" and rdau:P60048 "has carrier type".

Switch to new marc2bibframe2 converter

Library of Congress and Index Data have released the new MARC to BIBFRAME2 converter:
https://github.com/lcnetdev/marc2bibframe2

The converter looks very promising and based on initial tests it would seem feasible to switch to using it instead of the old BIBFRAME1 converter marc2bibframe. Also marc2bibframe-wrapper is not needed if we run the converter using xsltproc, as the documentation suggests.

Consolidate subjects (schema:about) between works

Curretly subjects get attached to some works, but not all. For example, for translated works, subjects are attached to the translation work but not the original work.

In the consolidate step subjects should be propagated between original and translated works, both ways. Also two different translations of the same work should end up with the same subjects.

Get missing original work titles (240) from 130, or 500 notes

Fennica has around 50000 records of translated works (with 041$h) without the original work title in a 240 field. The marc2bibframe converter does not create original work entities for these.

We should try to enrich the MARC records before the RDF conversion using the following pseudocode logic:

Preconditions: If the records is a translation (has 041$h subfield) AND if there is no 240 field,

  1. If there is a 130 field, copy its value to 240
  2. Else if there is a 500 note with the substring "alkuteos:" or "alkuteos :" (case-insensitive), then copy the remainder of the note to 240

Need a unit test and a few example records exercising the different cases.

Split off reconciliation from Schema.org conversion

Currently the same SPARQL query (bf-to-schema.rq) is used for converting from BIBFRAME to Schema.org and for reconciling entities. This is getting difficult to maintain. It should be split to a separate SPARQL CONSTRUCT query. Doing it like this would also allow inserting some non-SPARQL cleanup/normalization code in between these phases, if necessary.

Reconciliation unit tests have already been split off the Schema.org unit tests.

Work key transformation is too slow

The fix for #35 caused a massive slowdown of the work transformation step. It used to take about 2 minutes with Fennica data, now it has been running for nearly 2 hours and still has not even produced any results.

Probably the step has to be reimplemented in something else than SPARQL, because with SPARQL it's just unbearably slow and there is no obvious way of optimizing the create-work-transformations.rq query while keeping the current behavior.

Convert corporate name subjects (610)

Organizations / corporate names as subject (MARC field 610) are currently not being converted to BIBFRAME and therefore not to Schema.org either, e.g. in the ekumeeninen-00585 example record used in unit tests. Need to investigate.

Conversion to Schema.org is slow

Conversion from BIBFRAME to Schema.org used to take less than an hour for the whole Fennica data set, using all four cores on the linkeddata-kk VM. Now after the latest changes it takes 3 hours.

It's possible that this is inevitable, just because the complexity of the bf-to-sparql.rq query has increased. But it should be investigated, maybe there is a simple way of making it faster.

Distinguish different translations

Currently we merge Works that are actually different translations of the same work, because the work-key for a translated work includes only the original title and target language, but not the name of the translator.

For example, there are at least three Swedish translations of Aleksis Kivi's "Seitsemän veljestä", by Diktonius, Laurén and Warburton, and an abbreviated translation by Rostén. All these translations get merged to a single CreativeWork, which is wrong. They should be separated.

The problem is that we don't have very complete information about translators. However, if we just generate multiple work keys based on all contributors (MARC field 700), then the sets should hopefully not overlap and we should get the correct number of CreativeWorks in most cases (though there may be errors in the other contributors, for example preface authors).

Use more specific CreativeWork types

Currently all schema.org bibliographic entities are typed as schema:CreativeWork and instance level entities also as schema:Book. The latter is wrong in some cases. We should use the correct, more specific schema.org CreativeWork subtypes.

Uncertain or inferred years are not valid in Schema.org

The MARC records sometimes have uncertain or inferred years, e.g. 1984? or [1850] or other special values such as year ranges. These are not valid values for the schema:datePublished value. Should we dumb them down to a valid value or not? This would inevitably lose some precision.

505 $u causes RDF/XML syntax breakage and subsequent errors

I get this riot error:

14:36:24 ERROR riot                 :: [line: 574716, col: 78] {E205} rdf:resource is not allowed as an element tag here.

for this data:

006835912 50580 L $$tViemäreiden sisäpuoliset saneerausmenetelmät : raportti /$$rKaunisto, Tuija ja$$rPelto-Huikko, Aino$$gs. 1-129.$$uhttp://urn.fi/URN:ISBN:978-951-633-128-0$$9FENNI<KEEP>

which was converted into this RDF/XML:

   <bf:Work xmlns:bf="http://bibframe.org/vocab/"
            rdf:about="http://urn.fi/URN:NBN:fi:bib:fennica:006835912work28">
      <bf:creator rdf:resource="http://urn.fi/URN:NBN:fi:bib:fennica:006835912agent30"/>
      <bf:creator rdf:resource="http://urn.fi/URN:NBN:fi:bib:fennica:006835912agent31"/>
      <rdf:resource rdf:resource="http://urn.fi/URN:ISBN:978-951-633-128-0"/>
      <bf:title>Viemäreiden sisäpuoliset saneerausmenetelmät : raportti</bf:title>
   </bf:Work>

The problem is that rdf:resource is used as XML element name i.e. RDF property, and Jena doesn't like that. It's really a bug in the marc2bibframe converter (see this code) but I don't expect anyone to fix that. Instead a little post-processing of marc2bibframe output could provide a stopgap so that further processing steps (work-keys and schema) don't break because of this.

Cache hdt-cpp and hdt-java dependencies

Currently we have to rebuild hdt-cpp and hdt-java in Travis CI every time we run tests, which takes more than 3 minutes. If we could cache the git source trees, that would save a lot of time, while still staying up to date with the current hdt-cpp and hdt-java versions.

Rewrite marc2bibframe2-generated URIs

I'm not 100% happy with the URIs that marc2bibframe2 generates, e.g. http://urn.fi/URN:NBN:fi:bib:me:000095841#Instance and http://urn.fi/URN:NBN:fi:bib:me:000146854#Work765-40. They provide uniqueness and relative stability, but they are rather long and also sometimes confusing, e.g. xxxx#Work830-xx is used for what is essentially a Series entity. Also the sequence numbers (position within MARC record) are fragile - inserting or deleting an otherwise unrelated field in the middle of a record will change them.

The new URIs should consist of:

  • a base URI, just like currently
  • a localname, consisting of:
    • Single letter identifiying the type of entity (W, I, S, A = Work, Instance, Series or Agent)
    • the record number from which the entity was generated
    • two-digit sequence number distinguishing multiple entities of the same type from the same record:
      • 00 = the Work or Instance representing the whole record (i.e. what is currently just xxxx#Work or xxxx#Instance)
      • 01, 02, 03... = sequential numbering of entities of the same type from the same record, with 01 indicating the first such entity from the record, 02 the second etc.

The URI rewriting must happen after all the merging has been done, since only then do we have an accurate global view of all the URIs that exist within the dataset.

Strip birth/death years from recent authors

Due to data protection concerns, we should strip personal information i.e. birth and death years from living or recently deceased authors and contributors (i.e. people mentioned in 100, 600 and 700 fields). Even people with identical names can still be uniquely identified once we get the people identifiers in place (#18).

According to our lawyer, personal information of people who have been dead for 25 years is not protected in Finland (currently - this is going to change in 2018 with the new EU data protection directive, which limits data protection to the living only). Based on that, we could leave the information in place for those people who have died before 1990, or if the death date is unknown, those born (or otherwise known to have lived) before 1870.

In practice this could probably be implemented as a Fix script for Catmandu.

Prefer work URIs generated from 765 field over others (e.g. from 600)

A record generally has only one 765 field, though there are some exceptions. Work URIs generated from 765 fields are thus "more unique" than ones generated from e.g. 600 fields, of which there may be many in a record. Such 765-generated work URIs should be preferred when selecting a work URI for a merged work cluster (but not over original work URIs generated for the record itself).

Travis builds error a lot

About 1 of 2 Travis builds fail with an error, claiming 10 minutes of inactivity. Should investigate

Add bf:Work and bf:Instance types

It's a bit of a step backwards for Schema.org conversion, but I think it would make sense to add additional types for Works and Instances, since schema:CreativeWork can be used to represent both.
Since BIBFRAME 2.0 is the current best source for Work/Instance types, I think bf:Work and bf:Instance could be added as types even during/after Schema.org conversion.

Reconcile people with identifiers

Currently people mentioned in e.g. 100 and 700 fields are converted to resources by marc2bibframe and there is no reconciliation, so there is lots of duplication even from within the same MARC record. We should instead link to URIs for people in the output.

The problem is that currently we don't have the person authority record IDs in the bibliographic records. This is being worked on on the Aleph side. Probably it makes sense to wait for that instead of doing our own reconciliation.

Split Aleph sequence preprocessing from MARCXML conversion

Currently the whole chain from (split) Aleph sequence to MARCXML is handled by one target in the Makefile, mrcx. This works fine for now, but makes it difficult to debug the individual substeps or to add new substeps into the process.

The alephseq preprocessing should be split into a separate step, so that the mrcx step only involves running Catmandu with suitable options and Fix scripts.

Deal with incorrect RDA Carrier categories

There are several supposedly RDA Carrier categories in the metadata that don't actually match the official values. See the breakdown in #15. For example, Digitaalinen jäljenne does not exist in RDA

Slicing fails if input file has changed

If the input file (input/*.alephseq) changes, running make slice does not always produce a good result. For example, changing the first record in the unit test input test/input/slice.alephseq will produce a file that contains both the old and new versions of that record. This is related to the way small slices are merged into larger batches for efficiency.

Also need unit tests to guard against problems like this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.