GithubHelp home page GithubHelp logo

miku / metha Goto Github PK

View Code? Open in Web Editor NEW
114.0 8.0 13.0 77.26 MB

Command line OAI-PMH harvester and client with built-in cache.

Home Page: https://lab.ub.uni-leipzig.de/metha/

License: GNU General Public License v3.0

Makefile 0.28% Go 6.61% Shell 0.33% HTML 89.39% Python 1.38% Jupyter Notebook 2.02%
harvest oai code4lib hacktoberfest

metha's Introduction

metha

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. Data Providers are repositories that expose structured metadata via OAI-PMH. Service Providers then make OAI-PMH service requests to harvest that metadata. -- https://www.openarchives.org/pmh/

The metha command line tools can gather information on OAI-PMH endpoints and harvest data incrementally. The goal of metha is to make it simple to get access to data, its focus is not to manage it.

DOI Project Status: Active – The project has reached a stable, usable state and is being actively developed.

The metha tool has been developed for project finc at Leipzig University Library (lab).

Why yet another OAI harvester?

  • I wanted to crawl Arxiv but found that existing tools would timeout.
  • Some harvesters would start to download all records anew, if I interrupted a running harvest.
  • There are many OAI endpoints out there. It is a widely used protocol and somewhat worth knowing.
  • I wanted something simple for the command line; also fast and robust - metha as it is implemented now, is relatively robust and more efficient than requesting all record one-by-one (there is one annoyance which will hopefully be fixed soon).

How it works

The functionality is spread accross a few different executables:

  • metha-sync for harvesting
  • metha-cat for viewing
  • metha-id for gathering data about endpoints
  • metha-ls for inspecting the local cache
  • metha-files for listing the associated files for a harvest

To harvest and endpoint in the default oai_dc format:

$ metha-sync http://export.arxiv.org/oai2
...

All downloaded files are written to a directory below a base directory. The base directory is ~/.cache/metha by default and can be adjusted with the METHA_DIR environment variable.

When the -dir flag is set, only the directory corresponding to a harvest is printed.

$ metha-sync -dir http://export.arxiv.org/oai2
/home/miku/.metha/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky
$ METHA_DIR=/tmp/harvest metha-sync -dir http://export.arxiv.org/oai2
/tmp/harvest/I29haV9kYyNodHRwOi8vZXhwb3J0LmFyeGl2Lm9yZy9vYWky

The harvesting can be interrupted at any time and the HTTP client will automatically retry failed requests a few times before giving up.

Currently, there is a limitation which only allows to harvest data up to the last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST 2016, the harvester would request all data since the repositories earliest date and 2016-04-20 23:59:59.

To stream the harvested XML data to stdout run:

$ metha-cat http://export.arxiv.org/oai2

You can emit records based on datestamp as well:

$ metha-cat -from 2016-01-01 http://export.arxiv.org/oai2

This will only stream records with a datestamp equal or after 2016-01-01.

To just stream all data really fast, use find and zcat over the harvesting directory.

$ find $(metha-sync -dir http://export.arxiv.org/oai2) -name "*gz" | xargs unpigz -c

To display basic repository information:

$ metha-id http://export.arxiv.org/oai2

To list all harvested endpoints:

$ metha-ls

Further examples can be found in the metha man page:

$ man metha

Installation

Use a deb, rpm release, or the go tool:

$ go install -v github.com/miku/metha/cmd/...@latest

Limitations

Currently the endpoint URL, the format and the set are concatenated and base64 encoded to form the target directory, e.g:

$ echo "U291bmRzI29haV9kYyNodHRwOi8vY29wYWMuamlzYy5hYy51ay9vYWktcG1o" | base64 -d
Sounds#oai_dc#http://copac.jisc.ac.uk/oai-pmh

If you have very long set names or a very long URL and the target directory exceeds e.g. 255 chars (on ext4), the harvest won't work.

Harvesting Roulette

$ URL=$(shuf -n 1 <(curl -Lsf https://git.io/vKXFv)); metha-sync $URL && metha-cat $URL

In 0.1.27 a metha-fortune command was added, which fetches a random article description and displays it.

$ metha-fortune
Active Networking is concerned with the rapid definition and deployment of
innovative, but reliable and robust, networking services. Towards this end we
have developed a composite protocol and networking services architecture that
encourages re-use of protocol functions, is well defined, and facilitates
automatic checking of interfaces and protocol component properties. The
architecture has been used to implement common Internet protocols and services.
We will report on this work at the workshop.

    -- http://drops.dagstuhl.de/opus/phpoai/oai2.php

$ metha-fortune
In this paper we show that the Lempert property (i.e., the equality between the
Lempert function and the Carathéodory distance) holds in the tetrablock, a
bounded hyperconvex domain which is not biholomorphic to a convex domain. The
question whether such an equality holds was posed by Abouhajar et al. in J.
Geom. Anal. 17(4), 717–750 (2007).

    -- http://ruj.uj.edu.pl/oai/request

$ metha-fortune
I argue that Gödel's incompleteness theorem is much easier to understand when
thought of in terms of computers, and describe the writing of a computer
program which generates the undecidable Gödel sentence.

    -- http://quantropy.org/cgi/oai2

$ metha-fortune
Nigeria, a country in West Africa, sits on the Atlantic coast with a land area
of approximately 90 million hectares and a population of more than 140 million
people. The southern part of the country falls within the tropical rainforest
which has now been largely depleted and is in dire need of reforestation. About
10 percent of the land area was constituted into forest reserves for purposes
of conservation but this has suffered perturbations over the years to the
extent that what remains of the constituted forest reserves currently is less
than 4 percent of the country land area. As at today about 382,000 ha have been
reforested with indigenous and exotic species representing about 4 percent of
the remaining forest estate. Regrettably, funding of the Forestry sector in
Nigeria has been critically low, rendering reforestation programme near
impossible, especially in the last two decades. To revive the forestry sector
government at all levels must re-strategize and involve the local communities
as co-managers of the forest estates in order to create mutual dependence and
interaction in resource conservation.

    -- http://journal.reforestationchallenges.org/index.php/REFOR/oai

Scrape all metadata in a best-effort way

Use an endless loop with a timeout to get out of any hanging connection (which happen). Example scrape, converted to JSON (326M records, 60+ GB: 2023-11-01-metha-oai.ndjson.zst).

$ while true; do \
    timeout 120 metha-sync -list | \
    shuf | \
    parallel -j 64 -I {} "metha-sync -base-dir ~/.cache/metha {}"; \
done

Alternatively, use a metha.service file to run harvests continuously.

metha stores harvested data in one file per interval; to combine all XML files into a single JSON file you can utilize the xmlstream.go (adjust the harvest directory):

$ fd . '/data/.cache/metha' -e xml.gz | parallel unpigz -c | xmlstream -D

For notes on parallel processing of XML see: Faster XML processing in Go.

Errors this harvester can somewhat handle

  • responses with resumption tokens that lead to empty responses
  • gzipped responses, that are not advertised as such
  • funny (illegal) control characters in XML responses
  • repositories, that won't respond unless the dates are given with the exact granualarity
  • repositories with endless token loops
  • repositories that do not support selective harvesting, use -no-intervals flag
  • limited repositories, metha will try a few times with an exponential backoff
  • repositories, which throw occasional HTTP errors, although most of the responses look good, use -ignore-http-errors flag

Authors

Misc

Show formats of random repository:

$ shuf -n 1 <(curl -Lsf https://git.io/vKXFv) | xargs -I {} metha-id {} | jq .formats

A snippet from a 2010 publication:

The Open Archives Protocol for Metadata Harvesting (OAI-PMH) (Lagoze and van de Sompel, 2002) is currently implemented by more than 1,700 digital library reposi- tories world-wide and enables the exchange of metadata via HTTP. -- Interweaving OAI-PMH Data Sources with the Linked Data Cloud

Metha elsewhere

Asciicast

asciicast

metha's People

Contributors

acz-unibi avatar dvglc avatar gunnihinn avatar justinkelly avatar miku avatar titabo2k avatar white-gecko avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metha's Issues

Urlencode resuptionToken

Firstly, thanks for the amazing project!!

  • it's working 99% perfect to do full oai harvests

1 issue I have is that 1 of the OAI feeds includes un-escaped characters in the resumptionToken

  • refer below for examples

Unescaped + - this url doesn't work

https://rosetta.slv.vic.gov.au/oaiprovider/request?resumptionToken=20210901000000@20210926235959@Primo@oai_dc@AAAWxpABYAAALK+AAU&verb=ListRecords

Escaped + - this url does work

https://rosetta.slv.vic.gov.au/oaiprovider/request?resumptionToken=20210901000000@20210926235959@Primo@oai_dc@AAAWxpABYAAALK%2BAAU&verb=ListRecords

Is there a way to escape the resumptionToken with metha ?

  • or a config options

Cheers

Justin Kelly

Dependency Issue with Version 0.2.37

Hi @miku,

It has been a while :-) We worked with version metha_0.2.32_amd64.deb which worked nicely.

Today I tried version metha_0.2.37_amd64.deb but got a problem when running it:

metha-sync -v
metha-sync: /lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.32' not found (required by metha-sync) metha-sync: /lib/x86_64-linux-gnu/libc.so.6: version GLIBC_2.34' not found (required by metha-sync)

I installed it like this: sudo dpkg -i metha_0.2.37_amd64.deb on a Ubuntu 20.04.3 LTS with go version go1.13.8 linux/amd64.

To me this looks like a dynamically linked lib that is missing. Maybe related: https://community.tmpdir.org/t/problem-with-go-binary-portability-lib-x86-64-linux-gnu-libc-so-6-version-glibc-2-32-not-found/123

For now, I switched to metha_0.2.36_amd64.deb which works. So I assume it must be something related to https://github.com/miku/metha/releases/tag/v0.2.37 as the release message states:

  • built with Go 1.20.2
  • updated deps

Let me know if I can be of any assistance.

Kind regards,

Tobias

Selective Harvesting and metha-cat

Hi @miku,

We are adding more and more OAI-PMH endpoints and metha does a great job!

I have a question about selective harvesting and metha-cat. I have automated harvesting via crontab.
After an initial harvest that gets all records from the earliest day on, we do one selective harvest a week:

metha-sync -T 5m -r 20 -base-dir /mydir -format marcmxl https://zenodo.org/oai2d

Since all previous harvests are written to /mydir (local cache), metha-sync implicitly sets the -from param according to the last harvest, correct?

Now with metha-cat (without providing a timestamp), I have observed that more records are returned in the virtual XML that are actually in the repo, so I assume this includes also updates of a record (so the same record can occur multiple times in metha-cat's output). Is this interpretation correct?

EDIT: What I'd like to get is the latest version of each record via metha-cat.

Thanks and kind regards,

Tobias

undefined base64.RawURLEncoding

Is this something to be concerned about?

$ go get github.com/miku/metha/cmd/...
# github.com/miku/metha
gocode/src/github.com/miku/metha/harvest.go:87: undefined: base64.RawURLEncoding

To me it looks as though the package doesn't successfully build. (Nothing in gocode/bin/.)

Why is data only harvested up to the last day?

The readme says

Currently, there is a limitation which only allows to harvest data up to the last day. Example: If the current date would be Thu Apr 21 14:28:10 CEST 2016, the harvester would request all data since the repositories earliest date and 2016-04-20 23:59:59.

which is indeed the current behavior. Do you remember what the reason for this limitation is? Is it something inherent in the OAI protocol, or does it come from somewhere else?

I'm using metha to harvest the arXiv and am curious about this one-day delay.

Implement various harvesting strategies properly.

metha should implement various harvesting strategies:

  • normal/default (for standard conform endpoints), harvest windows, daily, monthly, yearly, all
  • single records, so individual records may fail or servers are not overloaded
  • other modes: all at once

Implementation ideas:

Instead of relying only on files, introduce a small manifest.json describing the harvested content (ids, dates, harvesting dates, files).

Date Parsing Issue

Hi @miku,

I have encountered an issue occurring when trying to parse an empty string as a date. I am not sure whether this is a problem with the OAI-PMH endpoint doing something illegal or metha itself (not sure if the OAI-PMH specs allow for empty strings as dates).

This is what I got for metha-sync -base-dir . -format marcxml "https://zentralgut.ch/oai":

INFO[0000] https://zentralgut.ch/oai?verb=Identify
INFO[0000] harvest: &{BaseURL:https://zentralgut.ch/oai Format:marcxml Set: From: Until: Client:0xc00010c640 MaxRequests:1048576 DisableSelectiveHarvesting:false CleanBeforeDecode:true IgnoreHTTPErrors:false MaxEmptyResponses:10 SuppressFormatParameter:false HourlyInterval:false DailyInterval:false ExtraHeaders:map[] KeepTemporaryFiles:false Delay:0 Identify:0xc000488938 Started:0001-01-01 00:00:00 +0000 UTC Mutex:{state:0 sema:0}}
FATA[0000] parsing time "" as "2006-01-02T15:04:05Z": cannot parse "" as "2006"

I am using metha 0.2.57.

I could make it work providing the -from option. 2006-01-02 occurs as a sample for -from date in the docs. So that's maybe which this is mentioned in the error message?

Thanks for your feedback and kind regards,

Tobias

conflicting namespace prefixes during ListRecords

If you do a harvest during which the same prefix will be seen with different URL targets, metha-sync will jumble the prefix – suffixing it by 1 but never declaring that renamed prefix, so the resulting XMLs become invalid.

For example, if I do

metha-sync -format mets -set 17th-century-prints http://digital.slub-dresden.de/oai/

then (because in our MODS the namespace for the extension slub has been changed some time ago and now appears in some records with declaration http://www.slub-dresden.de/namespace but with http://www.slub-dresden.de/ in others) I end up with altered and non-wellformed METS files. For example in oai:de:slub-dresden:db:id-1840307358, instead of…

               <mods:extension>
                  <slub:slub>
                     <slub:id type="digital">1840307358</slub1:id>
                     <slub:id type="source">113051157X</slub1:id>
                     <slub:id type="tsl-ats">Mercgeovg</slub1:id>
                  </slub:slub>
               </mods:extension>
               <mods:recordInfo>
                  <mods:recordIdentifier source="http://digital.slub-dresden.de/oai/">oai:de:slub-dresden:db:id-1840307358</mods:recordIdentifier>
               </mods:recordInfo>

…(which is what you get for a single GetRecord request) I now see…

               <mods:extension>
                  <slub1:slub>
                     <slub1:id type="digital">1840307358</slub1:id>
                     <slub1:id type="source">113051157X</slub1:id>
                     <slub1:id type="tsl-ats">Mercgeovg</slub1:id>
                  </slub1:slub>
               </mods:extension>
               <mods:recordInfo>
                  <mods:recordIdentifier source="http://digital.slub-dresden.de/oai/">oai:de:slub-dresden:db:id-1840307358</mods:recordIdentifier>
               </mods:recordInfo>

…(which is invalid, because slub1 has never been introduced).

Harvest hangs on UTF-8 errors

I'm wondering, is it intentional that harvesting hangs when invalid UTF-8 is encountered? I'm getting the following error and the harvesting stops.

XML syntax error on line 567: invalid UTF-8

If it is possible, it would be nice if harvesting could continue even in the case of UTF-8 errors as it is in the case of HTTP errors if user has provided -ignore-http-errors flag. I'm using metha 0.1.15 installed via go get.

Failed With Unprocessable Entity

Hi @miku,

I experience a problem with an endpoint that used to work before. I am not sure if this is a problem with the endpoint itself. Maybe you can give me some guidance.

$ metha-sync -v
0.3.0
$ metha-sync -base-dir . -format marcxml -set user-lory_phlu https://zenodo.org/oai2d
INFO[0000] https://zenodo.org/oai2d?verb=Identify       
INFO[0000] harvest: &{BaseURL:https://zenodo.org/oai2d Format:marcxml Set:user-lory_phlu From: Until: Client:0xc0000847f0 MaxRequests:1048576 DisableSelectiveHarvesting:false CleanBeforeDecode:true IgnoreHTTPErrors:false MaxEmptyResponses:10 SuppressFormatParameter:false HourlyInterval:false DailyInterval:false ExtraHeaders:map[] KeepTemporaryFiles:false Delay:0 Identify:0xc00013aec0 Started:0001-01-01 00:00:00 +0000 UTC Mutex:{state:0 sema:0}} 
INFO[0000] https://zenodo.org/oai2d?from=2014-02-03T00:00:00Z&metadataPrefix=marcxml&set=user-lory_phlu&until=2014-02-28T23:59:59Z&verb=ListRecords 
FATA[0000] failed with Unprocessable Entity on https://zenodo.org/oai2d?from=2014-02-03T00:00:00Z&metadataPrefix=marcxml&set=user-lory_phlu&until=2014-02-28T23:59:59Z&verb=ListRecords: <nil> 

Among the returned info from the endpoint, I see "DailyInterval:false". Does this explain the issue?

UPDATE: I see "DailyInterval:false" also with other endpoints that work.

Thanks!

Tobias

Bad page state in metha-sync (arm)

Debian 10 on arm.

...
[85698.893018] BUG: Bad page state in process metha-sync  pfn:000f0
[85698.899047] page:eedfa1c0 count:0 mapcount:0 mapping:00000000 index:0x1
[85698.905676] flags: 0x400(arch_1)
[85698.908913] raw: 00000400 00000100 00000200 00000000 00000001 00000000 ffffffff 00000000
[85698.917021] raw: 00000000
[85698.919644] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
[85698.925836] bad because of flags: 0x400(arch_1)
[85698.930374] Modules linked in: tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag nf_tables nfnetlink softdog overlay zstd zram zsmalloc orion_wdt pwm_fan nfsd lm75 marvell_cesa ip_tables x_tables raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 md_mod
[85698.930421] CPU: 1 PID: 31563 Comm: metha-sync Not tainted 4.19.63-mvebu #5.91
[85698.930423] Hardware name: Marvell Armada 380/385 (Device Tree)
[85698.930438] [<c010ce11>] (unwind_backtrace) from [<c01095eb>] (show_stack+0xb/0xc)
[85698.930446] [<c01095eb>] (show_stack) from [<c0706b25>] (dump_stack+0x69/0x78)
[85698.930453] [<c0706b25>] (dump_stack) from [<c01ac8e1>] (bad_page+0xa5/0xe4)
[85698.930460] [<c01ac8e1>] (bad_page) from [<c01ae94b>] (get_page_from_freelist+0x757/0xad4)
[85698.930465] [<c01ae94b>] (get_page_from_freelist) from [<c01af22d>] (__alloc_pages_nodemask+0xc5/0xa20)
[85698.930473] [<c01af22d>] (__alloc_pages_nodemask) from [<c01cb9e1>] (handle_mm_fault+0x42d/0x980)
[85698.930479] [<c01cb9e1>] (handle_mm_fault) from [<c01100db>] (do_page_fault+0xd3/0x22c)
[85698.930484] [<c01100db>] (do_page_fault) from [<c011034d>] (do_DataAbort+0x3d/0xa8)
[85698.930489] [<c011034d>] (do_DataAbort) from [<c0101dcf>] (__dabt_usr+0x4f/0x60)
[85698.930491] Exception stack(0xedd7ffb0 to 0xedd7fff8)
[85698.930495] ffa0:                                     00000000 00000000 00000000 00000000
[85698.930499] ffc0: 00000000 00000000 00000000 00000000 0867c000 09f50000 020007e0 09f50000
[85698.930503] ffe0: 09f4ffe1 a61fedb8 00033108 00066c60 80080010 ffffffff
[85698.930506] Disabling lock debugging due to kernel taint
[85698.930508] BUG: Bad page state in process metha-sync  pfn:000f1
[85698.936528] page:eedfa1e4 count:0 mapcount:0 mapping:00000000 index:0x1
[85698.943157] flags: 0x400(arch_1)
[85698.946393] raw: 00000400 00000100 00000200 00000000 00000001 00000000 ffffffff 00000000
[85698.954500] raw: 00000000
[85698.957124] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
[85698.963316] bad because of flags: 0x400(arch_1)
[85698.967854] Modules linked in: tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag nf_tables nfnetlink softdog overlay zstd zram zsmalloc orion_wdt pwm_fan nfsd lm75 marvell_cesa ip_tables x_tables raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 md_mod
[85698.967891] CPU: 1 PID: 31563 Comm: metha-sync Tainted: G    B             4.19.63-mvebu #5.91
[85698.967892] Hardware name: Marvell Armada 380/385 (Device Tree)
...

System details.

tir@helios4:~ $ cat /etc/armbian-release 
# PLEASE DO NOT EDIT THIS FILE
BOARD=helios4
BOARD_NAME="Helios4"
BOARDFAMILY=mvebu
BUILD_REPOSITORY_URL=https://github.com/armbian/build
BUILD_REPOSITORY_COMMIT=0d21d90f
VERSION=5.91
LINUXFAMILY=mvebu
BRANCH=next
ARCH=arm
IMAGE_TYPE=user-built
BOARD_TYPE=conf
INITRD_ARCH=arm
KERNEL_IMAGE_TYPE=zImage

tir@helios4:~ $ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

tir@helios4:~ $ go env
GOARCH="arm"
GOBIN=""
GOCACHE="/home/tir/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="arm"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/tir/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/lib/go-1.11"
GOTMPDIR=""
GOTOOLDIR="/usr/lib/go-1.11/pkg/tool/linux_arm"
GCCGO="gccgo"
GOARM="6"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -marm -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build486501526=/tmp/go-build -gno-record-gcc-switches"

Metha-Cat: Support for Paging?

Hi there,

We are using metha-sync to harvest quite a big set. Everything went smoothly and we could create an XML using metha-cat containing the whole set.
The XML is quite big (2.5 GB) and we have some difficulties processing it.

Is there a way to get the records in steps of a limited size (like paging with a defined size and offset) with metha-cat? Setting the from or until params wouldn't help us much I think (libraries might process big batches on a single day).

Thanks a lot!

`-format` not respected?

Thanks for supplying this great tool!

When trying to harvest ArXiv using metha-sync http://export.arxiv.org/oai2 -format "arXiv" -rm, I'm getting a log of harvest: &{BaseURL:http://export.arxiv.org/oai2 Format:oai_dc Set: From: Until: Client:<RETRACTED> MaxRequests:1048576 DisableSelectiveHarvesting:false CleanBeforeDecode:true IgnoreHTTPErrors:false MaxEmptyResponses:10 SuppressFormatParameter:false HourlyInterval:false DailyInterval:false ExtraHeaders:map[] KeepTemporaryFiles:false Delay:0 Identify:<RETRACTED> Started:0001-01-01 00:00:00 +0000 UTC Mutex:{state:0 sema:0}} (note the Format:oai_dc bit).

The tmp files also include the oai_dc-formatted XML: <Response><responseDate>2024-01-08T11:29:07Z</responseDate><request verb="ListRecords" set="" metadataPrefix="oai_dc">....

Am I calling metha-sync the wrong way, or is this a bug?

authorization // character limit

Hi, I am trying to use metha to access an endpoint url that requires an authentication token in the header. If I add the header to the endpoint url it exceeds the 255 character limit. It would be nice if metha would allow for either

  1. custom headers
  2. custom directories (to solved the 255 character limit issue)
    so that I could somehow include my authentication token and access this repository.

thanks

decode failed due to XML header

If I do metha-sync -format mets -set 17th-century-prints http://digital.slub-dresden.de/oai/ I finally end up with a fatal error after this:

INFO[0024] decode failed with: <?xml version="1.0" encoding="UTF-8"?> 
INFO[0025] decode failed with: <?xml version="1.0" encoding="ISO-8859-1"?> 
INFO[0027] decode failed with: <?xml version="1.0" encoding="WINDOWS-1252"?> 
INFO[0028] decode failed with: <?xml version="1.0" encoding="UTF-16"?> 
INFO[0029] decode failed with: <?xml version="1.0" encoding="US-ASCII"?> 

Is this a problem with our data, or with the parser?

metha-cat - can not open the "dir" extablished in .cache/metha

Dear all, I have an issue with metha-cat / ls -la $(metha-sync -dir endpoint ( https://api.europeana.eu/oai/record)

Example:
ubuntu@harv01:~$ ls -la $(metha-sync -dir https://api.europeana.eu/oai/record)
ls: cannot access '/home/ubuntu/.cache/metha/I29haV9kYyNodHRwczovL2FwaS5ldXJvcGVhbmEuZXUvb2FpL3JlY29yZA': No such file or directory

ubuntu@harv01:~$ metha-cat -from 2018-01-01 https://api.europeana.eu/oai/record | xmllint --format -
FATA[0000] open /home/ubuntu/.cache/metha/I29haV9kYyNodHRwczovL2FwaS5ldXJvcGVhbmEuZXUvb2FpL3JlY29yZA: no such file or directory
-:1: parser error : Document is empty
I have installed xmllint

I have harvested data with
metha-sync -from 2018-08-01 -until 2018-09-30 -format edm https://api.europeana.eu/oai/record
xml-files in .cach/metha/I29haV9kYyNodHRwczovL2FwaS5ldXJvcGVhbmEuZXUvb2FpL3JlY29yZA
I can also read with $xmlint -format endpoint

Please advice, why the system can not recognize above mentioned directory - do I need additional permissions?
1537810 drwxr-xr-x 2 user user 12288 Aug 25 18:29 I2VkbSNodHRwczovL2FwaS5ldXJvcGVhbmEuZXUvb2FpL3JlY29yZA

Many thanks,
Kurt ([email protected])

retry on OAI exceptions

2017/02/05 05:50:23 http://api.openaire.eu/oai_pmh?resumptionToken=...&verb=ListRecords
2017/02/05 05:51:30 oai: InternalException Could not send Message.

metha-sync should catch SIGINT

Running metha-sync can take a long time. Aborting with Ctrl+C (SIGINT) should be possible to just keep what has been harvested so far. By now SIGINT will leave METHA_DIR in a corrupted state with temporary files instead of .gz files. Better catch SIGINT and finish zipping the files instead.

Request Entity Too Large

while trying to harvest authority data from DNB OAI endpoint, I'm getting following error:

INFO[0000] https://services.dnb.de/oai/repository?from=2008-04-01T00:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&until=2008-04-30T23:59:59Z&verb=ListRecords 
FATA[0001] failed with Request Entity Too Large on https://services.dnb.de/oai/repository?from=2008-04-01T00:00:00Z&metadataPrefix=MARC21-xml&set=authorities:person&until=2008-04-30T23:59:59Z&verb=ListRecords: <nil>

any chance to fix this?

this metha-sync call is following:

metha-sync -format MARC21-xml -set authorities:person https://services.dnb.de/oai/repository

two different resumptionTokens?

Dear,

I notice that in the result xml's were two different resumption tokens:

...</record><resumptionToken completeListSize="56" cursor="" expirationDate=""></resumptionToken></ListRecords><ListSets><resumptionToken completeListSize="" cursor="" expirationDate=""></resumptionToken></ListSets></Response>

  • one under ListRecords element and
  • an empty one in ListSets

It seems, that the empty one under ListSets was added by 'metha-sync' and not in the repository response.

Observe anyone else similar?

Best, Andreas

Migration from Goodtables to Frictionless Repository

Hi @miku,

Goodtables.io is going to be deprecated in 2022, we, therefore, recommend migrating to the new Frictionless Repository (https://repository.frictionlessdata.io/) continuous data validation system provided by Frictionless Data. The core difference between the two projects is that Frictionless Repository doesn't rely on any hosted infrastructure except for Github Actions which makes this project more sustainable. Also, it uses a newer Frictionless Framework under the hood that brought many improvements over the old goodtables-py library in terms of validation quality and performance.

If you have any doubts or questions, please come and ask in our Discord chat or in the GitHub Discussion.

Client Timeout

Hi,

Is there a way to increase the client timeout?
I did a quick search for "timeout" but the only thing I found was:

I wanted to crawl Arxiv but found that existing tools would timeout.

:-)

We are harvesting quite a big collection using metha-sync and got

FATA[5443] Get "https://xyz.ch/request?resumptionToken=2022-05-01T00:00:00Z@2022-05-31T23:59:59Z@set_name@marc21@111111111&verb=ListRecords": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

For now, I have just started the process again.

Thanks for any hint!

parametrize "max empty responses"

currently, the value for "max empty response" is set to 10 and cannot be modified via an option of the commandline command. However, we discovered cases, were we run into this issue and I guess it would be nice, if we could increase it a bit, to see whether we'll get some more data out of the OAI endpoint.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.