GithubHelp home page GithubHelp logo

alveo-uima's Introduction

Alveo-UIMA

An interface to Alveo for the Apache UIMA Framework.

Overview

This package provides a partially bidirectional translation layer between the REST API of Alveo and the UIMA framework. The interface is built on top of a Java wrapper for the Alveo REST API, which exposes Alveo data structures as native Java data structures.

The component for reading the documents and associated annotations from Alveo is implemented as a UIMA Collection Reader – that is, a component which produces UIMA documents which are then available for subsequent processing, generally in a Collection Processing Engine (CPE). This reader takes as parameters an Alveo item list ID, a base URL and an API key. It converts the Alveo items from that item list, as well as their associated annotations, into UIMA CAS documents, which can then be used as part of a conventional UIMA CPE pipeline.

The second component is a CAS consumer for the inverse operation – taking annotations from a UIMA pipeline and associating them with the document in Alveo. Note that there is no capability to add new documents (which is why the system is only partly bidirectional), since this is not offered by Alveo's REST API. This means the annotations can only is expected to have been produced by the previously-described collection reader (which means we can be confident that the item has the correct metadata associated with it). The UIMA processing pipeline can, of course, take advantage of the annotations downloaded from Alveo (for example, by assuming that a speaker turn annotation corresponds to a sentence).

In each case, the annotations need to be converted between from Alveo to UIMA and vice versa, since the annotation formats are not identical. Textual character offsets are directly convertible, while others require more work. Annotation types in Alveo are URIs, while UIMA encodes types as members of a fully-specified type system. To convert from Alveo type URIs to UIMA types, a UIMA type system is automatically generated by enumerating all of the types available on the Alveo server, and the type URI is also stored as an attribute of the UIMA annotation. When uploading annotations, some sensible default behaviours are used. Firstly a configurable list of named attributes (or features in UIMA terminology) are inspected on the UIMA annotations, and the first match found is used as the Alveo annotation type. If no matches are found, a type URI is inferred from the UIMA fully-qualified annotation type name. Alveo also has a notion of labels. These are again stored as UIMA attributes when reading from Alveo, and when uploading, as with the annotation types, there is a similar list of UIMA feature names which can be used to populate the label attribute on the Alveo annotation, falling back to the empty string if nothing is found. If these strategies do not produce the desired behaviour when uploading, it is possible to customise them by supplying an implementation of au.edu.alveo.uima.conversions.UIMAToAlveoAnnConverter. Alternatively, it is also possible to insert a custom UIMA component into the pipeline to convert the UIMA annotations added by other components so that the Alveo conversion works as desired.

Building

The project uses a fairly standard Maven build setup. Build using

$ mvn compile

Usage

Using UIMAfit

This is built using UIMAfit, a collection of tools to allow more flexibility and simplicity in creating and configuring UIMA processing pipelines. This means that it can most easily be run directly from Java code. The point of interaction for reading an Alveo-based collection is the class au.edu.alveo.uima.ItemListCollectionReader

Reading Annotations

An example of usage of this class can be found in src/main/java/au/edu/alveo/uima/examples/ItemListCollectionReaderExample.java. This main class takes the following as arguments (run the class with no arguments for more detailed usage information):

  • a server URI
  • an API key
  • an item list ID
  • an output directory

It creates a UIMA pipeline (with UIMAfit, rather than an XML-based CPE) using the collection reader and an extra processing component which just serializes the documents output by the collection reader to disk (in a real-world pipeline, we might want to do more at this stage). You can then manually examine the created XML from the output directory, or run the Annotation Viewer GUI (org.apache.uima.tools.AnnotationViewerMain), specifying typesystem.xml which will have been written the root of the output directory, as the type system.

Uploading Annotations

The class au.edu.alveo.uima.ItemAnnotationUploader allows the inverse operation – annotations provided by other UIMA components can be uploaded to the Alveo server. The expected usage for this is that it would be part of a pipeline, with the ItemListCollectionReader instance as the collection reader, and any other desired processing components would be inserted into the pipeline before instantiating the annotation uploader.

Here is an example of how you could use UIMAfit to run an uploading pipeline, which augments the items with POS tags from the the OpenNLP POS tagger annotator of DKPro:

/**
 * Run a pipeline which adds POS tags and sentence boundaries to the items.
 *
 * @param serverUrl  The base URL of the Alveo server
 * @param apiKey     The API key for the Alveo server
 * @param itemListId The ID of the item list to read from the server
 */
public static void runPipeline(String serverUrl, String apiKey, String itemListId)
		throws UIMAException, IOException {
	CollectionReaderDescription reader = ItemListCollectionReader.createDescription(
			ItemListCollectionReader.PARAM_ALVEO_BASE_URL, serverUrl,
			ItemListCollectionReader.PARAM_ALVEO_API_KEY, apiKey,
			ItemListCollectionReader.PARAM_ALVEO_ITEM_LIST_ID, itemListId,
			ItemListCollectionReader.PARAM_INCLUDE_RAW_DOCS, false);
	AnalysisEngineDescription segmenter = AnalysisEngineFactory.createEngineDescription(OpenNlpSegmenter.class);
	AnalysisEngineDescription posTagger = AnalysisEngineFactory.createEngineDescription(OpenNlpPosTagger.class);

    /* Set the names of features which will be used by default to populate the Alveo label values
    * More complicated mappings are possible by implementing
    * au.edu.alveo.uima.conversions.UIMAToAlveoAnnConverter and supplying the
    * name of that class in parameter ItemListCollectionReader.PARAM_ANNOTATION_CONVERTERS */
	String[] labelFeatures = new String[] {
			"de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS:PosValue",
			ItemAnnotationUploader.DEFAULT_LABEL_FEATURE
	};
	// Set the names of types we wish to upload to the server. Other types are ignored.
	String[] uploadableTypes = new String[] {
			"de.tudarmstadt.ukp.dkpro.core.api.lexmorph.type.pos.POS",
			"de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence"
	};
	AnalysisEngineDescription uploader = AnalysisEngineFactory.createEngineDescription(ItemAnnotationUploader.class,
			ItemAnnotationUploader.PARAM_ALVEO_BASE_URL, serverUrl,
			ItemAnnotationUploader.PARAM_ALVEO_API_KEY, apiKey,
			ItemAnnotationUploader.PARAM_LABEL_FEATURE_NAMES, labelFeatures,
			ItemAnnotationUploader.PARAM_UPLOADABLE_UIMA_TYPE_NAMES, uploadableTypes);
	AnalysisEngineDescription aggAe = AnalysisEngineFactory.createEngineDescription(segmenter, posTagger, uploader);
	SimplePipeline.runPipeline(reader, aggAe);
}

The UIMA annotations are converted to Alveo format using au.edu.alveo.uima.conversions.DefaultUIMAToAlveoAnnConverter by default, which attempts to populate the type and label features sensibly as described above. See the class documentation for some more details about this.

More information on adding annotations using UIMA can be found in the more extensive examples in the Alveo UIMA tutorial.

Using XML-based descriptors

For a more traditional workflow based on CPEs defined by XML descriptors, there is an XML-descriptor for the Collection Reader which is automatically written to target/generated-sources/uimafit/au/edu/alveo/uima/ItemListCollectionReader.xml when mvn package is run. However currently this doesn't include a valid type system for two reasons. One is that there is an open issue which prevents this. The second is that the type system needs to be dynamically generated by talking to a live server (since we can't know all the types without talking to the server) so the Maven plugin which does the auto-generation wouldn't help.

For this reason, there is a class au.edu.alveo.uima.utils.WriteDynamicDescriptors which can be manually invoked from the command-line to create these descriptors. These descriptors can then be used to manually create a CPE (by writing XML), or by running the CPE configurator GUI (org.apache.uima.tools.cpm.CpmFrame).

alveo-uima's People

Contributors

admackin avatar

Watchers

Andrew Foster avatar  avatar Peter Sefton avatar James Cloos avatar Timothy Jones avatar  avatar Richard Eckart de Castilho avatar Suren Shrestha avatar Steve Cassidy avatar Michael Bauer avatar Navid Shokouhi avatar

Forkers

sindhuchary

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.