jhuapl / accumulograph Goto Github PK

An implementation of TinkerPop Blueprints using Accumulo

License: Apache License 2.0

Java 100.00%

accumulograph's Introduction

AccumuloGraph

This is an implementation of the TinkerPop Blueprints 2.6 API using Apache Accumulo as the backend. This combines the many benefits and flexibility of Blueprints with the scalability and performance of Accumulo.

In addition to the basic Blueprints functionality, we provide a number of enhanced features, including:

Indexing implementations via IndexableGraph and KeyIndexableGraph
Support for mock, mini, and distributed instances of Accumulo
Numerous performance tweaks and configuration parameters
Support for high speed ingest
Hadoop integration

Feel free to contact us with bugs, suggestions, pull requests, or simply how you are leveraging AccumuloGraph in your own work.

Getting Started

First, include AccumuloGraph as a Maven dependency. Releases are deployed to Maven Central.

<dependency>
	<groupId>edu.jhuapl.tinkerpop</groupId>
	<artifactId>blueprints-accumulo-graph</artifactId>
	<version>0.2.1</version>
</dependency>

For non-Maven users, the binary jars can be found in the releases section in this GitHub repository, or you can get them from Maven Central.

Creating an AccumuloGraph involves setting a few parameters in an AccumuloGraphConfiguration object, and opening the graph. The defaults are sensible for using an Accumulo cluster. We provide some simple examples below. Javadocs for AccumuloGraphConfiguration explain all the other parameters in more detail.

First, to instantiate an in-memory graph:

Configuration cfg = new AccumuloGraphConfiguration()
  .setInstanceType(InstanceType.Mock)
  .setGraphName("graph");
return GraphFactory.open(cfg);

This creates a "Mock" instance which holds the graph in memory. You can now use all the Blueprints and AccumuloGraph-specific functionality with this in-memory graph. This is useful for getting familiar with AccumuloGraph's functionality, or for testing or prototyping purposes.

To use an actual Accumulo cluster, use the following:

Configuration cfg = new AccumuloGraphConfiguration()
  .setInstanceType(InstanceType.Distributed)
  .setZooKeeperHosts("zookeeper-host")
  .setInstanceName("instance-name")
  .setUser("user").setPassword("password")
  .setGraphName("graph")
  .setCreate(true);
return GraphFactory.open(cfg);

This directs AccumuloGraph to use a "Distributed" Accumulo instance, and sets the appropriate ZooKeeper parameters, instance name, and authentication information, which correspond to the usual Accumulo connection settings. The graph name is used to create several backing tables in Accumulo, and the setCreate option tells AccumuloGraph to create the backing tables if they don't already exist.

AccumuloGraph also has limited support for a "Mini" instance of Accumulo.

Improving Performance

This section describes various configuration parameters that greatly enhance AccumuloGraph's performance. Brief descriptions of each option are provided here, but refer to the AccumuloGraphConfiguration Javadoc for fuller explanations.

Disable consistency checks

The Blueprints API specifies a number of consistency checks for various operations, and requires errors if they fail. Some examples of invalid operations include adding a vertex with the same id as an existing vertex, adding edges between nonexistent vertices, and setting properties on nonexistent elements. Unfortunately, checking the above constraints for an Accumulo installation entails significant performance issues, since these require extra traffic to Accumulo using inefficient non-batched access patterns.

To remedy these performance issues, AccumuloGraph exposes several options to disable various of the above checks. These include:

setAutoFlush - to disable automatically flushing changes to the backing Accumulo tables
setSkipExistenceChecks - to disable element existence checks, avoiding trips to the Accumulo cluster
setIndexableGraphDisabled - to disable indexing functionality, which improves performance of element removal

Tweak Accumulo performance parameters

Accumulo itself features a number of performance-related parameters, and we allow configuration of these. Generally, these relate to write buffer sizes, multithreading, etc. The settings include:

setMaxWriteLatency - max time prior to flushing element write buffer
setMaxWriteMemory - max size for element write buffer
setMaxWriteThreads - max threads used for element writing
setMaxWriteTimeout - max time to wait before failing element buffer writes
setQueryThreads - number of query threads to use for fetching elements, properties etc.

Enable edge and property preloading

As a performance tweak, AccumuloGraph performs lazy loading of properties and edges. This means that an operation such as getVertex does not by default populate the returned vertex object with the associated vertex's properties and edges. Instead, they are initialized only when requested via getProperty, getEdges, etc. These are useful for use cases where you won't be accessing many of these properties. However, if certain properties or edges will be accessed frequently, you can set options for preloading these specific properties and edges, which will be more efficient than on-the-fly loading. These options include:

setPreloadedProperties - set property keys to be preloaded
setPreloadedEdgeLabels - set edges to be preloaded based on their labels

Enable caching

AccumuloGraph contains a number of caching options that mitigate the need for Accumulo traffic for recently-accessed elements. The following options control caching:

setVertexCacheParams - size and expiry for vertex cache
setEdgeCacheParams - size and expiry for edge cache
setPropertyCacheTimeout - property expiry time, which can be specified globally and/or for individual properties

High Speed Ingest

One of Accumulo's key advantages is its ability for high-speed ingest of huge amounts of data. To leverage this ability, we provide an additional AccumuloBulkIngester class that exchanges consistency guarantees for high speed ingest.

The following is an example of how to use the bulk ingester to ingest a simple graph:

AccumuloGraphConfiguration cfg = ...;
AccumuloBulkIngester ingester = new AccumuloBulkIngester(cfg);
// Add a vertex.
ingester.addVertex("A").finish();
// Add another vertex with properties.
ingester.addVertex("B")
  .add("P1", "V1").add("P2", "V2")
  .finish();
// Add an edge.
ingester.addEdge("A", "B", "edge").finish();
// Shutdown and compact tables.
ingester.shutdown(true);

See the Javadocs for more details. Note that you are responsible for ensuring that data is entered in a consistent way, or the resulting graph will have undefined behavior.

Hadoop Integration

AccumuloGraph features Hadoop integration via custom input and output format implementations. VertexInputFormat and EdgeInputFormat allow vertex and edge inputs to mappers, respectively. Use as follows:

AccumuloGraphConfiguration cfg = ...;

// For vertices:
Job j = new Job();
j.setInputFormatClass(VertexInputFormat.class);
VertexInputFormat.setAccumuloGraphConfiguration(j, cfg);

// For edges:
Job j = new Job();
j.setInputFormatClass(EdgeInputFormat.class);
EdgeInputFormat.setAccumuloGraphConfiguration(j, cfg);

ElementOutputFormat allows writing to an AccumuloGraph from reducers. Use as follows:

AccumuloGraphConfiguration cfg = ...;

Job j = new Job();
j.setOutputFormatClass(ElementOutputFormat.class);
ElementOutputFormat.setAccumuloGraphConfiguration(j, cfg);

Rexster Configuration

Below is a snippet to show an example of AccumuloGraph integration with Rexster. For a complete list of options for configuration, see AccumuloGraphConfiguration$Keys

<graph>
	<graph-enabled>true</graph-enabled>
	<graph-name>myGraph</graph-name>
	<graph-type>edu.jhuapl.tinkerpop.AccumuloRexsterGraphConfiguration</graph-type>
	<properties>
		<blueprints.accumulo.instance.type>Distributed</blueprints.accumulo.instance.type>
		<blueprints.accumulo.instance>accumulo</blueprints.accumulo.instance>
		<blueprints.accumulo.zkhosts>zk1,zk2,zk3</blueprints.accumulo.zkhosts>
		<blueprints.accumulo.user>user</blueprints.accumulo.user>
		<blueprints.accumulo.password>password</blueprints.accumulo.password>
	</properties>
	<extensions>
	</extensions>
</graph>

accumulograph's People

Contributors

Stargazers

Watchers

Forkers

mikelieberman wingdog markbakker78 adeze estkae votrongdao sabrects simonellistonball patkelsh desperado1992 renesugar

accumulograph's Issues

Enable the user to disable the support of IndexableGraph

IndexableGraph makes some calls significantly slower -- If they do not want to use IndexableGraph we can remove those calls

Upgrade to Accumulo 1.6

Major improvements in Accumulo 1.6 and the use of conditional mutations will be helpful.

Allow a list of splits

When creating a graph, I'd like to specify the splits as an array/list rather than a string to be split on whitespace. E.g.:
AccumuloGraphConfiguration.splits(String... splits)

Rather than
AccumuloGraphConfiguration.splits(String splits)

This gives me more control over what splits will be added, including splits that have whitespace.

Upgrade to TinkerPop Blueprints 2.5

Keep upto date with the TinkerPop APIs. Should not have a large impact.

Creating a first release

I think we should pick a point to create a release so there are complied binaries available for use. Also it would be the copy in a mirror of the central maven repository. Doing this would increase the ease of use of the code base.

GitHub has a way to create a release and store the binaries while we workout getting into a mirror of the central repository.

Add Auto Indexing

As a user, I might not want to have to create an index for every property I for-see. Turning on auto index would create an index for every property added

ALL Flag for preloading

I can imagine a user who has a lot of vertexcies and will acccess properties in a random manner. We should add an option to just preload everything.

Autoflush is true by default

The documentation for AccumuloGraphConfiguration states that autoFlush is enabled by default. However, examining AccumuloGraphConfiguration.java, line 113 indicates that it is disabled by default. These should be made consistent.

Implementation of Graph Query

A lot to do for this one. Might bump up to a milestone and break down the tasks under it

Required vs optional configuration parameters

There are lots of configuration options available in AccumuloGraphConfiguration, and I don't know which of these are required and which are optional. This needs to be called out more explicitly in documentation and / or validation code.

Implementation of BatchGraph

The AccumuloBulkIngester needs to be converted to a TinkerPop batch graph.

Might have to see if there are TinkerPop tests for it.

Apply consistent formatting to code base

Depends on #41.

Update APIs to TP3

We might be able to have a V3 API just call into our 2.5 API

MapReduceElement incorrecly implements WritableComparable with the wrong type.

It is currently implementing WritableComparable < MapReduceVertex > when it should implement WritableComparable < MapReduceElement >

MapReduce Output Format

Graph Output Format?
Element Output Format?

Edges need to preload properties

Noticed a todo on getEdge to add preloading to edges

Improved Exception Handling

Right now, a lot of fatal errors are swallowed only allowing the program to go until the next error.

We should throw errors when we can not recover

InputFormatTests fail siently

An exception is thrown in the input format that the LocalJobRunner handles.
Need to fix this and add something that makes sure the Mapper is called atleast once like how the exceptions on checked for.

I didnt notice it since I am on windows. You can look at the logs for a TravisCI build and see it.

Code missed inital

Need to add code that missed initial import

Element Caching

Vertex and Edge accesses should be using a local cache

Option to clear graph using AccumuloBulkIngester

Currently there is no way to clear the graph to be ingested, using AccumuloBulkIngester, on instantiation. There is an option for creating the graph, but not for clearing an existing graph. This option could be added one of two ways:

Add a "setClear" method to AccumuloGraphConfiguration.
Add a "clear" method to AccumuloBulkIngester.

Option 1 seems cleaner as it could also apply when creating an AccumuloGraph instance.

Addition of OutputFormat

MapReduce output format

Option 1) Element output format -- no validation and just write stuff out

Option 2) Graph output format -- Use a Tinkergraph to force validation and write it out.

Open to more options

Upgrade to TP 2.5.1 when it releases and make a new release

2.5.1 is incoming soon -- update to it and publish a new release

Readd (deprecated) old configuration method names

Rather than completely remove the old configuration methods (e.g. getName vs getGraphName), which I did already... readd the old ones as deprecated methods.

New Test - preloading and caching Properties

Duplicate Tinkerpop suite and change it to use preloading and caching Property methods. These have not been tested. Use sonarqube to see what exactly is not covered yet.

Caching is not utilized everywhere

There are spots that the vertex and edge cache should be referenced where they currently not being used

Non-tinkerpop tests

Our code that exists outside of the TinkerPop test suit is untested outside of functional tests

Testing with different version of Accumulo

1.4, 1.5, 1.6 document what it works with

Added edge preloading for vertices

There is a configuration property for setting the edge labels you want to preload but I don't think it is being used.

Update ReadMe.md

Improved documentation is always good.

Extend MapReduce Integration

We need to have EdgeInputFormat.

Discuss possible OutputFormats for MapReduce integration

Travis CI Build times out

It is timing out when it tries to download https://repository.apache.org/releases/antlr/antlr/2.7.7/antlr-2.7.7.pom (currently a 404).

This is a rexster dependency. Perhaps rexters 2.5 will be working. I will look into this weekend.

Test Issue Please Ignore

Iterator-backed properties

Allow the use of element properties that are backed by some iterator. For example, an integer property with a SummingCombiner in the "background", so that when the property is "set", the value actually gets incremented. Retrieving the property will get back the current sum.

Store certain attributes in value

I would like to optionally store certain attribute values in the "value" field of Accumulo tables, rather than in the column qualifier. For example, this would be useful for storing large data types such as images. This would be enabled through the configuration, something like:
AccumuloGraphConfiguration.storeInValue(String attribute)

using SonaType

SonaType is the opensource mirror for MavenCental Repo. Investigate what is involved with using it.

Allow "per -property" cache time to live configuration

Currently AccumuloGraphConfiguration has a "setPropertyCacheTimeout" that takes a single "millis" value time to live (TTL). This means that either all or no properties are cached, and if they are they are all cached for the same TTL.

We would like to expand the method to allow setting per-property TTL. This would allow users to configure the cache so that specific fast-changing properties have short (or no) time to live, while static properties can have long (or forever) TTL. The capability should also allow for a "default" TTL which is applied to all non-specified properties.

For example, a user could configure the following:

// identifiers never change; keep it forever
cfg.setPropertyCacheTimeout("id", Long.MAX_VALUE);

// names change infrequently, check for updates at most once per day
cfg.setPropertyCacheTimeout("name", 1000L*60*60*24);

// moods change frequently, check for updates every 10 minutes
cfg.setPropertyCacheTimeout("mood", 1000L*60*10);

// positions change quite frequently; never cache
cfg.setPropertyCacheTimeout("position", 0);

// any other property not explicitly called out above, cache for up to 1 minute
cfg.setPropertyCacheTimeout(null, 1000L*60);

The cases that should be handled are:

Case	Behavior
No property timeout set.	No properties should be cached.
Only a default property timeout set.	All properties cached for the specified default timeout.
Only specific property timeout(s) set.	For each named property specified, the corresponding timeout value should be used. Properties not explicitly specified should not be cached.
Specific property timeout(s) set and default property timeout(s) set.	For each named property specified, the corresponding timeout value should be used. Properties not explicitly specified should use the default timeout property value.

A timeout value of Long.MAX_VALUE should never timeout.
A timeout value of 0 should never be cached.
A timeout value < 0 should not be allowed an throw an IllegalArgumentException.

Update Test Suite to TP3

Preloading inside of iterable methods

Some iterable can use preloading and they should be modified to allow it

Graph Validator

We allow the user to be non consitent. At some point you will want the graph to be consistent. Maybe an MR job/ single threaded validator?

Usage of Element types

We go back and forth between Type, Class and string for a lot of the methods.

Going to align them all to the Class

Indexing during BulkIngest

A bulk ingest seems like a good time to also build up indexes

Implementing Joint Neighbors

A user has a need for this analytic -- Might investigate the power of graph query to see or might start an AccumuloGraphAnalytic class

Code style guidelines and templates

We should provide a quick coding style guideline and/or templates for auto-formatting using popular IDEs (in particular, Eclipse for now).

In particular, we should minimally define:

tabs or spaces for indentation (and number of spaces for each)
max line length (if any)

There are any number of formats out there what we could reference, copy, or modify. But we should have something so the code maintains a consistent look-and-feel.