GithubHelp home page GithubHelp logo

conal-tuohy / apiharvester Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 0.0 17 KB

Harvest metadata records from web APIs such as Trove, DigitalNZ, RSS, etc

Home Page: http://conaltuohy.com/blog/web-api-harvesting/

Java 100.00%
metadata oai-pmh rss-downloader harvest web-api

apiharvester's Introduction

APIHarvester

An application for harvesting XML metadata records from Trove, DigitalNZ, and similar "Web APIs". APIHarvester can also be used to extract portions of an XML file stored on a local file system, simply by specifying a file: URI in the request. This makes it usable e.g. for checking quality of bulk metadata records.

APIHarvester is available as an executable Java archive (jar) file.

Running it without any parameters produces the following explanatory output:

APIHarvester is a tool to harvest XML records from a web API. APIHarvester will:

 • download a response from a given URL, if necessary retrying in the event of failure
 • split the response into multiple records which match an XPath expression
 • save each record under a filename specified using another XPath expression
 • continue harvesting from additional URLs, extracted from the response using another XPath expression

APIHarvester is controlled using XPath 1 expressions; see https://www.w3.org/TR/xpath/ for details.

Usage:

java -jar apiharvester.jar [parameter list]

Parameters are specified as [key=value]. Values containing spaces, ampersands, etc should be enclosed in quotes.
XML namespace prefixes can be bound to namespace URIs using 'xmlns:' parameters.

Parameters:

 • xmlns:foo
      Binds the 'foo' namespace prefix to a namespace URI, for use in the XPath expressions.
 • directory
      Location of output files. If not specified, the current directory is used.
 • url
      Initial URL to harvest from - required.
 • records-xpath
      XPath identifying the individual records within a response. If not specified, the entire response is saved as a single record.
 • id-xpath
      XPath of unique id for each record, evaluated within the context of each record - required.
 • discard-xpath
      XPath of elements or text which should be discarded, evaluated within the context of each record.
 • resume-when-xpath
      XPath determining whether to resume from a harvest page or not - default = "true()"
 • resumption-xpath
      XPath of URL or URLs for subsequent pages of data - if not specified only the initial URL will be harvested)
 • url-suffix
      Specifies a common suffix for URLs; useful for specifying an 'API key' for some APIs.
 • retries
      Specifies a number of times to retry in the event of any error; default is 3
 • delay
      Specifies a number of seconds to wait between requests; default is 0.
 • indent
      Specifies whether to indent the XML or not. Valid values are "yes" or "no". If unspecified, the value is "no".

Example:

java -jar apiharvester.jar retries=4 xmlns:foo="http://example.com/ns/foo" url="http://example.com/api?foo=bar" records-xpath="/foo:response/foo:result" id-xpath="concat('record-', @id)" discard-xpath="*[not(normalize-space())]" resumption-xpath="concat('/api?foo=bar&page=', /foo:response/@page-number + 1)" url-suffix="&api_key=asdkfjasd" indent=yes delay=10

See the Wiki for real examples

apiharvester's People

Contributors

conal-tuohy avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

apiharvester's Issues

generate a "continuation" command when harvest is interrupted

The harvester has a stack of URLs which need to be harvested; in the event that the harvest terminates abnormally (either by network error or by user interruption) it could dump these in the form of a command which could allow the harvest to pick up.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.