GithubHelp home page GithubHelp logo

00mjk / lobid-resources Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hbz/lobid-resources

0.0 0.0 0.0 40.68 MB

Transformation, web frontend, and API 2.0 for the hbz catalog as LOD

Home Page: http://lobid.org/resources

License: Eclipse Public License 2.0

Shell 2.83% Java 76.90% HTML 15.86% Scala 0.50% JavaScript 0.79% CSS 3.12%

lobid-resources's Introduction

About

Transform MAB-XML to JSON for Elasticsearch indexing with Metafacture, serve API and UI with Play Framework.

This repo replaces the lobid-resources part of https://github.com/lobid/lodmill.

For information about the Lobid architecture and development process, see http://hbz.github.io/#lobid.

Build

Prerequisites: Java 8, Maven 3; verify with mvn -version

Create and change into a folder where you want to store the projects:

  • mkdir ~/git ; cd ~/git

Build lobid-resources:

  • git clone https://github.com/hbz/lobid-resources.git
  • cd lobid-resources
  • mvn clean install

Build the web application:

  • cd web
  • wget http://downloads.typesafe.com/typesafe-activator/1.3.10/typesafe-activator-1.3.10-minimal.zip
  • unzip typesafe-activator-1.3.10-minimal.zip
  • ./activator-1.3.10-minimal/bin/activator test

See the .github/workflows/build.yml file for details on the CI config used by Github Actions.

Eclipse setup

Replace test with other Play commands, e.g. "eclipse with-source=true" (generate Eclipse project config files, then import as existing project in Eclipse), ~ run (run in test mode, recompiles changed files on save, use this to keep your Eclipse project in sync while working, make sure to enable automatic workspace refresh in Eclipse: Preferences > General > Workspace > Refresh using native hooks or polling).

Production

Use "start 8000" to run in production background mode on port 8000 (hit Ctrl+D to exit logs). To restart a production instance running in the background, you can use the included restart.sh script (configured to use port 8000). For more information, see the Play documentation.

Example of getting the data

In the online test the data is indexed to a living elasticsearch instance.
This instance is only reachable within our internally network, thus this test
must be executed manually. Then elasticsearch can be looked up like this:

http://lobid.org/resources/HT002619538

For querying it you can use the elasticsearch query DSL, like:

http://lobid.org/resources?q=title:Mobydick

The result shows the data which also Hbz01MabXml2ElasticsearchLobidTest.java produces.

Developer instructions

This section explains how to make a successful build after changing the transformations,
how to update the JSON-LD and its context, and how to index the data.

Changing transformations

After changing the morph the build must be executed:

mvn clean install

Two possible outcomes:

  • BUILD SUCCESS: the tested resources don’t reflect the changes.
    In this case you should add an Aleph-MabXml resource to hbz01XmlClobs.tar.bz2 that would reflect your changes. Do like this to add the resource HT018895767:
    cd src/test/resources; rm -rf hbz01XmlClobs; tar xfj hbz01XmlClobs; xmllint --format http://beta.lobid.org/hbz01/HT018895767 > hbhbz01XmlClobs/HT018895767; tar cfj hbz01XmlClobs.tar.bz2 hbz01XmlClobs; cd -; mvn clean install; git add src/test/resources/jsonld/ ; git add src/test/resources/hbz01.es.nt
  • BUILD FAILURE: the newly generated data isn’t equal to the test resources.
    This is a good thing because you wanted the change.

Doing mvn test -DgenerateTestData=true the test data is generated and also updated in the filesystem.
These new data will now act as the template for sucessful tests. So, if you would rebuild now, the build will pass successfully.
You just must approve the new outcome by committing it.

Now you must approve the new outcome.
Let’s see what has changed:

git status

Let’s make a diff on the changes, e.g. all JSON-LD documents:

git diff src/test/resources/jsonld/

You can validate the generated JSON-LD documents with the provided schemas:

cd src/test/resources; bash validateJsonTestFiles.sh

If you are satisfied with the changes, go ahead and add and commit them:

git add src/test/resources/jsonld/; git commit

Do this respectivly for all other test files (Ntriples …).
If you’ve added and commited everything, check again if all is ok:

mvn clean install

This should result in BUILD SUCCESS. Push your changes. You’re done :)

Propagate the context.json to lobid-resources-web

The generated context.jsonld is automatically written to the proper directory
so that it is automatically deployed when the web application is deployed.

When the small test set is indexed by using buildAndETLTestSet.sh deploy your branch in
the staging directory of the web application. The context for the resources is adapted
to use the “staging.lobid.org”-domain and thus the staging-context.jsonld will resolve using the one in that directory.

Elasticsearch index

This is about the building of the production index as well as the small test index.

Production

All automation is configured at one central crontab entry: hduser@weywot1 .
This is your starting point for tracing what scripts are triggered on what server
at which time. All is logged, see the crontab entries resp. the scripts.

Like with the old (and still productive) lobid index, a weekly fulldump
index is build on Saturday/Sunday. This is the base which will then be complemented by the daily incremental updates.
Have a look at both source data, base and daily updates.
The base-file must be referenced by a symbolic link residing in the nfs: “/files/open_data/closed/hbzvk/index.hbz-nrw.de/alephxml/clobs/baseline/aliasNewestFulldump.tar.gz”. It is this symbolic link which is used by indexing processes.
Building the full index takes around 12 hours. The finished index is aliased
to resources-staging. This is manually to be tested. If it’s ok, the
index is switched by simply renaming this alias to resources . Both aliases are
reflected in the lookup-URI (e.g. http://lobid.org/resources/HT015082724 vs
http://lobid.org/resources-staging/HT015082724 resp. internal ES URI
http://gaia.hbz-nrw.de:9200/resources/_all/HT015082724 and
http://gaia.hbz-nrw.de:9200/resources-staging/_all/HT015082724 ).

The three latest indices will be retained, also every index with no -staging
suffix on the alias. Other indices which start with the same string (e.g. resources) will be
removed by the index program after the indexing is finished – the hard disk
is not of endless size ;) .

Note: in production, always make sure that the JSON-LD context is the proper one.

Small test

This will download ~5k aleph xml clobs (if not already residing in your filesystem)
and index them into elasticsearch, aliased test-resources (not interfering with the
productive indexes, thus not removing any productive index).

cd src/test/resources; bash buildAndETLTestSet.sh

Data updates

How to manually invoke data updates for both API 1.x data and API 2.0 data.

Data source for updates and baselines: http://lobid.org/download/dumps/DE-605/mabxml/

These are created with weywot1:/home/hduser/git/hbz-aleph-dumping/bin/daily-hbz-aleph.sh.

For all commands below, add this to redirect output to file and follow that file:

> log/201605090945-master.startHbz01ToLobidResources.sh.log 2>&1 & tail -f log/201605090945-master.startHbz01ToLobidResources.sh.log

Data 1.x, at hduser@weywot1:/home/hduser/git/lodmill/lodmill-rd/doc/scripts/hbz01$

Single update:

bash -x startHbz01ToLobidResources.sh master /files/open_data/open/DE-605/mabxml/DE-605-aleph-update-marcxchange-20160411-20160412.tar.gz lobid-resources-staging NOALIAS quaoar2.hbz-nrw.de quaoar exact

Updates listed in file:

bash -x startHbz01ToLobidResources.sh master dummy_ignore lobid-resources-staging NOALIAS quaoar2.hbz-nrw.de quaoar exact doc/scripts/hbz01/toBeUpdateFilesXmlClobs_afterBasedump.txt

New baseline index:

bash -x startHbz01ToLobidResources.sh master /files/open_data/closed/hbzvk/index.hbz-nrw.de/alephxml/clobs/baseline/aliasNewestFulldump.tar.gz lobid-resources-201605090945 "-staging" quaoar2.hbz-nrw.de quaoar create doc/scripts/hbz01/toBeUpdateFilesXmlClobs_afterBasedump.txt

Data 2.0, at hduser@gaia:/opt/hadoop/git/lobid-resources$

Single update:

bash -x startHbz01ToLobidResources.sh master /files/open_data/open/DE-605/mabxml/DE-605-aleph-update-marcxchange-20160411-20160412.tar.gz resources-staging NOALIAS gaia.hbz-nrw.de lobid-gaia exact

Updates listed in file:

bash -x startHbz01ToLobidResources.sh master dummy_ignore resources-staging NOALIAS gaia.hbz-nrw.de lobid-gaia exact toBeUpdateFilesXmlClobs_afterBasedump.txt

New baseline index:

bash -x startHbz01ToLobidResources.sh master /files/open_data/closed/hbzvk/index.hbz-nrw.de/alephxml/clobs/baseline/aliasNewestFulldump.tar.gz resources-201605090945 NOALIAS gaia.hbz-nrw.de lobid-gaia create

License

Eclipse Public License: http://www.eclipse.org/legal/epl-v10.html

lobid-resources's People

Contributors

acka47 avatar christophewertowski avatar dependabot[bot] avatar dr0i avatar fsteeg avatar jschnasse avatar tobiasnx avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.