limetrans - Library Metadata Transformation
Configuration
Limetrans can be regarded as a configuration frame for the use of Metafacture for library purposes. It makes use of a JSON configuration scheme and can be abstracted as:
{
"input" : {
...
},
"transformation-rules" : "...",
"output": {
...
},
...
}
Input
Input is generally configured like this:
"input" : {
"queue" : {
"path" : "a/path/to/your/input/file/",
"pattern" : "your-marc-xml-input-file.xml",
"sort_by" : "lastmodified",
"order" : "desc",
"max" : 1,
"normalize-unicode" : false,
"processor" : "MARC21"
}
}
MARCXML
is the default value for 'processor'
thus 'processor'
can be omitted when processing MARCXML data.
Transformation
"transformation-rules" : "a/path/to/your/transformation/metafacture/rules/file.xml"
Output
By now, limetrans is written to be used with Elasticsearch. Therefore, the output object mainly contains an Elasticsearch configuration, besides a JSON output option.
"output": {
"json" : "a/path/to/your/jsonlines/output/file.jsonl",
"elasticsearch" : {
"cluster": "elasticsearch-01",
"host": ["localhost:9300"],
"index" : {
"type" : "title",
"name" : "choose-your-own-index-name",
"timewindow" : "yyyyMMdd",
"settings" : "a/path/to/your/elasticsearch/settings.json",
"mapping" : "a/path/to/your/elasticsearch/mapping.json",
"idKey" : "the-id-field-name-configured-in-your-metafacture-rules-file"
},
"update" : false,
"delete" : false,
"bulkAction" : "index",
"maxbulkactions" : 100000
},
"pretty-printing" : false
}
"type" : "title"
is a suggestion, assuming you might want to transform and store book title information.
Further configuration
"catalogid" : "choose-your-own-catalog-id",
"collection" : "choose-your-own-collection"
Please find examples for the configuration of limetrans in the source code, e.g. here.
Setup project
Get Source
$ git clone [email protected]:hbz/limetrans.git
Setup Elasticsearch
Download and install Elasticsearch
$ cd third-party
$ wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-2.1.1.zip
$ unzip elasticsearch-2.1.1.zip
$ cd elasticsearch-2.1.1
$ bin/elasticsearch
Check with curl -X GET http://localhost:9200/
if all is well.
Configure Elasticsearch
Currently, Elasticsearch is configured to run on a cluster named elasticsearch-01
, see e.g. here. Make sure you have accordingly configured the cluster name in /etc/elasticsearch/elasticsearch.yml
.
head plugin
Optionally, you may want to install the$ cd third-party/elasticsearch-2.4.0
$ bin/plugin install mobz/elasticsearch-head
Contribute
Coding conventions
Indent blocks by four spaces and wrap lines at 100 characters. For more details, refer to the Google Java Style Guide.
Bug reports
Please file bugs as an issue labeled "Bug" here.