GithubHelp home page GithubHelp logo

accumulograph's Introduction

AccumuloGraph

Build Status

This is an implementation of the TinkerPop Blueprints 2.6 API using Apache Accumulo as the backend. This combines the many benefits and flexibility of Blueprints with the scalability and performance of Accumulo.

In addition to the basic Blueprints functionality, we provide a number of enhanced features, including:

  • Indexing implementations via IndexableGraph and KeyIndexableGraph
  • Support for mock, mini, and distributed instances of Accumulo
  • Numerous performance tweaks and configuration parameters
  • Support for high speed ingest
  • Hadoop integration

Feel free to contact us with bugs, suggestions, pull requests, or simply how you are leveraging AccumuloGraph in your own work.

Getting Started

First, include AccumuloGraph as a Maven dependency. Releases are deployed to Maven Central.

<dependency>
	<groupId>edu.jhuapl.tinkerpop</groupId>
	<artifactId>blueprints-accumulo-graph</artifactId>
	<version>0.2.1</version>
</dependency>

For non-Maven users, the binary jars can be found in the releases section in this GitHub repository, or you can get them from Maven Central.

Creating an AccumuloGraph involves setting a few parameters in an AccumuloGraphConfiguration object, and opening the graph. The defaults are sensible for using an Accumulo cluster. We provide some simple examples below. Javadocs for AccumuloGraphConfiguration explain all the other parameters in more detail.

First, to instantiate an in-memory graph:

Configuration cfg = new AccumuloGraphConfiguration()
  .setInstanceType(InstanceType.Mock)
  .setGraphName("graph");
return GraphFactory.open(cfg);

This creates a "Mock" instance which holds the graph in memory. You can now use all the Blueprints and AccumuloGraph-specific functionality with this in-memory graph. This is useful for getting familiar with AccumuloGraph's functionality, or for testing or prototyping purposes.

To use an actual Accumulo cluster, use the following:

Configuration cfg = new AccumuloGraphConfiguration()
  .setInstanceType(InstanceType.Distributed)
  .setZooKeeperHosts("zookeeper-host")
  .setInstanceName("instance-name")
  .setUser("user").setPassword("password")
  .setGraphName("graph")
  .setCreate(true);
return GraphFactory.open(cfg);

This directs AccumuloGraph to use a "Distributed" Accumulo instance, and sets the appropriate ZooKeeper parameters, instance name, and authentication information, which correspond to the usual Accumulo connection settings. The graph name is used to create several backing tables in Accumulo, and the setCreate option tells AccumuloGraph to create the backing tables if they don't already exist.

AccumuloGraph also has limited support for a "Mini" instance of Accumulo.

Improving Performance

This section describes various configuration parameters that greatly enhance AccumuloGraph's performance. Brief descriptions of each option are provided here, but refer to the AccumuloGraphConfiguration Javadoc for fuller explanations.

Disable consistency checks

The Blueprints API specifies a number of consistency checks for various operations, and requires errors if they fail. Some examples of invalid operations include adding a vertex with the same id as an existing vertex, adding edges between nonexistent vertices, and setting properties on nonexistent elements. Unfortunately, checking the above constraints for an Accumulo installation entails significant performance issues, since these require extra traffic to Accumulo using inefficient non-batched access patterns.

To remedy these performance issues, AccumuloGraph exposes several options to disable various of the above checks. These include:

  • setAutoFlush - to disable automatically flushing changes to the backing Accumulo tables
  • setSkipExistenceChecks - to disable element existence checks, avoiding trips to the Accumulo cluster
  • setIndexableGraphDisabled - to disable indexing functionality, which improves performance of element removal

Tweak Accumulo performance parameters

Accumulo itself features a number of performance-related parameters, and we allow configuration of these. Generally, these relate to write buffer sizes, multithreading, etc. The settings include:

  • setMaxWriteLatency - max time prior to flushing element write buffer
  • setMaxWriteMemory - max size for element write buffer
  • setMaxWriteThreads - max threads used for element writing
  • setMaxWriteTimeout - max time to wait before failing element buffer writes
  • setQueryThreads - number of query threads to use for fetching elements, properties etc.

Enable edge and property preloading

As a performance tweak, AccumuloGraph performs lazy loading of properties and edges. This means that an operation such as getVertex does not by default populate the returned vertex object with the associated vertex's properties and edges. Instead, they are initialized only when requested via getProperty, getEdges, etc. These are useful for use cases where you won't be accessing many of these properties. However, if certain properties or edges will be accessed frequently, you can set options for preloading these specific properties and edges, which will be more efficient than on-the-fly loading. These options include:

  • setPreloadedProperties - set property keys to be preloaded
  • setPreloadedEdgeLabels - set edges to be preloaded based on their labels

Enable caching

AccumuloGraph contains a number of caching options that mitigate the need for Accumulo traffic for recently-accessed elements. The following options control caching:

  • setVertexCacheParams - size and expiry for vertex cache
  • setEdgeCacheParams - size and expiry for edge cache
  • setPropertyCacheTimeout - property expiry time, which can be specified globally and/or for individual properties

High Speed Ingest

One of Accumulo's key advantages is its ability for high-speed ingest of huge amounts of data. To leverage this ability, we provide an additional AccumuloBulkIngester class that exchanges consistency guarantees for high speed ingest.

The following is an example of how to use the bulk ingester to ingest a simple graph:

AccumuloGraphConfiguration cfg = ...;
AccumuloBulkIngester ingester = new AccumuloBulkIngester(cfg);
// Add a vertex.
ingester.addVertex("A").finish();
// Add another vertex with properties.
ingester.addVertex("B")
  .add("P1", "V1").add("P2", "V2")
  .finish();
// Add an edge.
ingester.addEdge("A", "B", "edge").finish();
// Shutdown and compact tables.
ingester.shutdown(true);

See the Javadocs for more details. Note that you are responsible for ensuring that data is entered in a consistent way, or the resulting graph will have undefined behavior.

Hadoop Integration

AccumuloGraph features Hadoop integration via custom input and output format implementations. VertexInputFormat and EdgeInputFormat allow vertex and edge inputs to mappers, respectively. Use as follows:

AccumuloGraphConfiguration cfg = ...;

// For vertices:
Job j = new Job();
j.setInputFormatClass(VertexInputFormat.class);
VertexInputFormat.setAccumuloGraphConfiguration(j, cfg);

// For edges:
Job j = new Job();
j.setInputFormatClass(EdgeInputFormat.class);
EdgeInputFormat.setAccumuloGraphConfiguration(j, cfg);

ElementOutputFormat allows writing to an AccumuloGraph from reducers. Use as follows:

AccumuloGraphConfiguration cfg = ...;

Job j = new Job();
j.setOutputFormatClass(ElementOutputFormat.class);
ElementOutputFormat.setAccumuloGraphConfiguration(j, cfg);

Rexster Configuration

Below is a snippet to show an example of AccumuloGraph integration with Rexster. For a complete list of options for configuration, see AccumuloGraphConfiguration$Keys

<graph>
	<graph-enabled>true</graph-enabled>
	<graph-name>myGraph</graph-name>
	<graph-type>edu.jhuapl.tinkerpop.AccumuloRexsterGraphConfiguration</graph-type>
	<properties>
		<blueprints.accumulo.instance.type>Distributed</blueprints.accumulo.instance.type>
		<blueprints.accumulo.instance>accumulo</blueprints.accumulo.instance>
		<blueprints.accumulo.zkhosts>zk1,zk2,zk3</blueprints.accumulo.zkhosts>
		<blueprints.accumulo.user>user</blueprints.accumulo.user>
		<blueprints.accumulo.password>password</blueprints.accumulo.password>
	</properties>
	<extensions>
	</extensions>
</graph>

accumulograph's People

Contributors

gmlewis avatar hanijames avatar mikelieberman avatar sandyhider avatar testuserforissues avatar webbrl1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

accumulograph's Issues

Allow a list of splits

When creating a graph, I'd like to specify the splits as an array/list rather than a string to be split on whitespace. E.g.:

AccumuloGraphConfiguration.splits(String... splits)

Rather than

AccumuloGraphConfiguration.splits(String splits)

This gives me more control over what splits will be added, including splits that have whitespace.

Creating a first release

I think we should pick a point to create a release so there are complied binaries available for use. Also it would be the copy in a mirror of the central maven repository. Doing this would increase the ease of use of the code base.

GitHub has a way to create a release and store the binaries while we workout getting into a mirror of the central repository.

Add Auto Indexing

As a user, I might not want to have to create an index for every property I for-see. Turning on auto index would create an index for every property added

ALL Flag for preloading

I can imagine a user who has a lot of vertexcies and will acccess properties in a random manner. We should add an option to just preload everything.

Autoflush is true by default

The documentation for AccumuloGraphConfiguration states that autoFlush is enabled by default. However, examining AccumuloGraphConfiguration.java, line 113 indicates that it is disabled by default. These should be made consistent.

Required vs optional configuration parameters

There are lots of configuration options available in AccumuloGraphConfiguration, and I don't know which of these are required and which are optional. This needs to be called out more explicitly in documentation and / or validation code.

Implementation of BatchGraph

The AccumuloBulkIngester needs to be converted to a TinkerPop batch graph.

Might have to see if there are TinkerPop tests for it.

Improved Exception Handling

Right now, a lot of fatal errors are swallowed only allowing the program to go until the next error.

We should throw errors when we can not recover

InputFormatTests fail siently

An exception is thrown in the input format that the LocalJobRunner handles.
Need to fix this and add something that makes sure the Mapper is called atleast once like how the exceptions on checked for.

I didnt notice it since I am on windows. You can look at the logs for a TravisCI build and see it.

Option to clear graph using AccumuloBulkIngester

Currently there is no way to clear the graph to be ingested, using AccumuloBulkIngester, on instantiation. There is an option for creating the graph, but not for clearing an existing graph. This option could be added one of two ways:

  1. Add a "setClear" method to AccumuloGraphConfiguration.
  2. Add a "clear" method to AccumuloBulkIngester.

Option 1 seems cleaner as it could also apply when creating an AccumuloGraph instance.

Addition of OutputFormat

MapReduce output format

Option 1) Element output format -- no validation and just write stuff out

Option 2) Graph output format -- Use a Tinkergraph to force validation and write it out.

Open to more options

Non-tinkerpop tests

Our code that exists outside of the TinkerPop test suit is untested outside of functional tests

Iterator-backed properties

Allow the use of element properties that are backed by some iterator. For example, an integer property with a SummingCombiner in the "background", so that when the property is "set", the value actually gets incremented. Retrieving the property will get back the current sum.

Store certain attributes in value

I would like to optionally store certain attribute values in the "value" field of Accumulo tables, rather than in the column qualifier. For example, this would be useful for storing large data types such as images. This would be enabled through the configuration, something like:

AccumuloGraphConfiguration.storeInValue(String attribute)

using SonaType

SonaType is the opensource mirror for MavenCental Repo. Investigate what is involved with using it.

Allow "per -property" cache time to live configuration

Currently AccumuloGraphConfiguration has a "setPropertyCacheTimeout" that takes a single "millis" value time to live (TTL). This means that either all or no properties are cached, and if they are they are all cached for the same TTL.

We would like to expand the method to allow setting per-property TTL. This would allow users to configure the cache so that specific fast-changing properties have short (or no) time to live, while static properties can have long (or forever) TTL. The capability should also allow for a "default" TTL which is applied to all non-specified properties.

For example, a user could configure the following:

// identifiers never change; keep it forever
cfg.setPropertyCacheTimeout("id", Long.MAX_VALUE);

// names change infrequently, check for updates at most once per day
cfg.setPropertyCacheTimeout("name", 1000L*60*60*24);

// moods change frequently, check for updates every 10 minutes
cfg.setPropertyCacheTimeout("mood", 1000L*60*10);

// positions change quite frequently; never cache
cfg.setPropertyCacheTimeout("position", 0);

// any other property not explicitly called out above, cache for up to 1 minute
cfg.setPropertyCacheTimeout(null, 1000L*60);

The cases that should be handled are:

Case Behavior
No property timeout set. No properties should be cached.
Only a default property timeout set. All properties cached for the specified default timeout.
Only specific property timeout(s) set. For each named property specified, the corresponding timeout value should be used. Properties not explicitly specified should not be cached.
Specific property timeout(s) set and default property timeout(s) set. For each named property specified, the corresponding timeout value should be used. Properties not explicitly specified should use the default timeout property value.

A timeout value of Long.MAX_VALUE should never timeout.
A timeout value of 0 should never be cached.
A timeout value < 0 should not be allowed an throw an IllegalArgumentException.

Graph Validator

We allow the user to be non consitent. At some point you will want the graph to be consistent. Maybe an MR job/ single threaded validator?

Usage of Element types

We go back and forth between Type, Class and string for a lot of the methods.

Going to align them all to the Class

Implementing Joint Neighbors

A user has a need for this analytic -- Might investigate the power of graph query to see or might start an AccumuloGraphAnalytic class

Code style guidelines and templates

We should provide a quick coding style guideline and/or templates for auto-formatting using popular IDEs (in particular, Eclipse for now).

In particular, we should minimally define:

  1. tabs or spaces for indentation (and number of spaces for each)
  2. max line length (if any)

There are any number of formats out there what we could reference, copy, or modify. But we should have something so the code maintains a consistent look-and-feel.

Maintain internal Connector pool

The AccumuloGraph implementation calls getConnector() all over the place, which seems to create a new Connector every time it's called. This seems wasteful. An internal pool should be maintained to avoid this connection overhead.

Storm Spout

Why not have connections to other mainstream distributed systems.

A spout over the graph could emit to a Vertex Stream and Edge Stream.
We can do the same we did with MapReduce and make stubs (or use the same).

Update JavaDocs

TinkerPop JavaDocs are not sufficient for our implementation. Need to write java docs for all of the parts in our code base.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.