calrissian / accumulo-recipes Goto Github PK

View Code? Open in Web Editor NEW

37.0 37.0 18.0 3.12 MB

Recipes & cookbooks for Accumulo.

Home Page: http://www.calrissian.org

License: Apache License 2.0

Java 100.00%

accumulo-recipes's People

Contributors

Stargazers

Watchers

Forkers

eawagner noedetore wjsl w39hh echalkpad kpelykh lossyrob keith-turner amiorin ryadzenine tspannhw vincentrussell dupin64 tool-recommender-bot chenzhangg cch0 logycon

accumulo-recipes's Issues

Blob store readme documenting the streaming of a file in/out

Event-store based on Wikisearch iterators

The wikisearch iterators make good use of the ORIterator and specifically, since the entire query is propagated down to the iterator, makes map/reduce jobs very easy to set up. I'd like to integrate this into Calrissian.

Also, the QueryBuilder client API is still being used and a query optimizer is still being used to optimize the client-side query before it is transformed into JEXL underneath and sent to the wikisearch OptimizedQueryIterator. I'd like to incorporate this same functionality.

Base metrics store needs methods to return types given a group and return names given a group and type.

This will not be efficient unless a new metadata index is added to the table to store the groups, types, and names. Something like the following would help quickly be able to return the necessary info.

metadata groups:group
metadata types\x00group:type
metadata names\x00group\x00type:name

Metrics swapping of group/type for easier "grouped" metrics searching

I did this today and I'm not sure if it warrants having to modify the store to make this happen (it may just be something to document in the wiki as a proposed solution for the following):

I have metrics being collected on different systems, let's say system A and system B. The metrics being collected are for the software in the different systems. I have made the system A/B the group and the type is the name of the software. The metric name is the actual metric (cpu utilization, etc...). The way the table is currently formatted, it's not possible to ask the question "give me metrics for all the systems using software x" because I'd have to know, at a minimum, all the groups that have that software. This isn't always possible.

What I've done inside of the application code that's calling down to the metrics store is to persist the regular metric but then persist it again with the group/type flipped. This way, I can say "Pull everything with software x into my map/reduce and calculate the metric results on that instead of on the entire system."

I'm not sure if this use-case warrants any type of utility that needs to be implemented in the Calrissian code, however, I will say that now I've got to do the tedious job of having to maintain my own indexes. Most likely, I'll end up prefixing the group/type and type/group indexes with something so I can tell them apart. It would be nice if I could allow the storage layer to do this for me.

Create release and push artifacts up to OSS Nexus

Move the filtering logic in the rangestore scans to an accumulo iterator

This would apply to the reverse scan and the overlapping scan, which will filter out ranges. Currently this is being done on the client side, and could more efficiently be done on the server side.

Elasticsearch entity store + QueryBuilder API

I've done a lot of work implementing the EntityStore interface on top of both Accumulo and Couchbase/Elasticsearch so that a client can talk to either and not have to think of the implications of the in-memory cached version vs. the BigData version. It's worked quite nicely and the Tinkerpop Graph implemented is only tied to the EntityStore interface so both Accumulo and ElasticSearch/Couchbase can be traversed in Gremlinfor connectivity and graph analytics.

Pig loaders/aggregators/accumulators for each of the stores.

Many of these stores can be streamed into Map/Reduce. This also means they can be streamed/processed into Pig.

Pluggable keyindex store

I've written a pluggable keyIndexer that will index keys with possible partitions and groups and agglomerate cardinalities so that queries can be performed.

This key indexer is being used in several different types of stores, most notably for type aheads of possible keys/values contained in a QFD store.

[EventStore] Kryo should be used to serialize the grouped events

This will minimize the amount of data sent over the server.

Make stats metrics service the default metric store

Mostly this is a start to address Corey's concern from Issue #38 about having too many store implementations. This is a prime candidate for some functionality consolidation. Currently this provides the full functionality of the default metric store with the added benefit of being able to gather more meaningful statistics.

Better query support for event store

Currently there are a lot of seemingly arbitrary restrictions to the query format that can be used for the event store. For example, ANDs can only be used at the bottom of the query tree, and only equals and not equals are supported. This store should have better query support.

For example on using JEXL, look at the wiki search example in the 1.4 tree of accumulo.
https://github.com/apache/accumulo/tree/1.4.5-SNAPSHOT/src/examples/wikisearch

This would probably be done piecemeal to add more complex support, but at a minimum, this store should be able to handle a complex query tree consisting of an arbitrary level and ANDs and ORs.

For the lastN store what is the point of grouping data on the server side into json

Specifically the EntryIterator using an object mapper to convert to json.

I don't really see any benefit to this for the lastN service. The biggest reason for this is that the query method (get) simply uses a scanner (as opposed to a batchscanner), so all the data will be grouped together already. It will be simply put the row by row grouping logic on the client instead of the server. This however is negligible when comparing the cost of using json deserialization on the client.

The biggest reason for this question however is to simplify the use of types system on the services. Due to the serialization in the iterator, the iterator must know about any custom types that are being used in order to generate the json. If there is no need for the iterator to group the data into json on the server, then this would alleviate the need to handle custom types in the iterator.

[Range Store] Range store should handle all interval representations.

This means being able to handle open and closed interval endpoints.

Currently, it assumes all intervals are close [a,b]. This is fine with the current implementation as longs are a discrete data type, and can therefore always* be represented using closed intervals. For example (2,10) == [3,9] for integer types.

Other data types, however would need the use of open intervals to represent correct ranges. For example the interval [0.0 , 2.0) does not have an accurate closed representation for double types.

NotEq() Working?

NotEq() does not seem to be working. The below Junit test does not seem to work

@test
public void NotInQuery() throws Exception, AccumuloException,
AccumuloSecurityException {

    StoreEntry event = new StoreEntry(UUID.randomUUID().toString(),
            currentTimeMillis());
    event.put(new Tuple("hasIp", "true", ""));
    event.put(new Tuple("ip", "1.1.1.1", ""));

    StoreEntry event2 = new StoreEntry(UUID.randomUUID().toString(),
            currentTimeMillis());
    event2.put(new Tuple("hasIp", "true", ""));
    event2.put(new Tuple("ip", "2.2.2.2", ""));

    StoreEntry event3 = new StoreEntry(UUID.randomUUID().toString(),
            currentTimeMillis());
    event3.put(new Tuple("hasIp", "true", ""));
    event3.put(new Tuple("ip", "3.3.3.3", ""));

    store.save(Collections.singleton(event));
    store.save(Collections.singleton(event2));
    store.save(Collections.singleton(event3));

    Node query = new QueryBuilder()
        .and()
            .notEq("ip", "1.1.1.1")
            .notEq("ip", "2.2.2.2")
            .notEq("ip", "4.4.4.4")
            .eq("hasIp", "true")
        .endStatement().build();

    Iterator<StoreEntry> itr = store.query(
            new Date(currentTimeMillis() - 5000), new Date(), query,
            new Auths()).iterator();

    int x = 0;

    while (itr.hasNext()) {
        x++;
        StoreEntry e = itr.next();
     assertEquals("3.3.3.3",e.get("ip").getValue());

    }
    assertEquals(1, x);
}

[BlobStore] bulk ingest mapreduce

Blobstore API changes

The default implementation of the blob store API should really only need two methods.

One method that returns an OutputStream to store data into accumulo, and another that returns an InputStream to retrieve the data.

Additional functionality should be added to sub classes. For example, the current metadata implementation (with hashmaps), or other implemtations that use the SummingCombiner to keep track of size, etc.

Move contents of common to mango-accumulo

Or vice versa. They are both serving the same purpose and it doesn't really make sense to keep them seperated.

Add additional stats to StatsMetricService

The first is the mean (average). This is simply an addition to the Stats data object as the required information is already done in the combiner.

The second is standard deviation. This would require a sum-squares addition to the combiner which is simply the sum of the square of each metric. Then on the client side the following calculation can find the std dev "sqrt(sumSquare / count - avg * avg)"
This was taken from org.apache.accumulo.core.util.Stat.

Both of these measurements are required to allow for calculations such as normal distribution and other statistical calculations.

Would be nice to map/reduce over the metrics store

Currently, I'm using natural keys like crazy as the group/type/name. It'd be nice to say "give me all the metrics for a group/type and a part of a name." I was also thinking there may be utility in allowing some N number of nested levels since their query patterns can lend themselves nicely to being scanned in Ranges while still allowing tools like Yammer to be used. This is more of a proposed design than anything- I'll be working on some possible implementations.

Create GeoStore extension of blob store for map tiles keyed by bounding boxes

Geo bounds should be in xyz format so different languages don't run into round off errors.

Entity multi-table bulk ingest via Map/reduce

I've written a GroupedRangePartitioner that I'm going to commit back to the Accumulo project. This range partitioner makes it possible to bulk ingest into multiple tables at the same time (i.e. index tables and shard tables).

This is important for environments that need to move data quickly and can't waste time running several map/reduce jobs.

Implement Entity Store based on the Rya split row design

Rya split row design consists of 3 different tables using subject predicate object as indexible items.

If done right, we should be able to write iterators that will allow us to intersect over rows for doing server side queries.

[BlobStore] Mapreduce chunk example

The sorted keys in a map/reduce could certainly allow flexibility in terms of aggregating streams to a certain degree and then flushing them when they become too big.

Possible use-cases for this?

PCAP can be EXTREMELY LARGE. What if you were trying to process several thousand PCAPs at the same time and your mappers become limited in memory? What if you could summarize the data you needed in window sizes of 4kb instead of having to append the entire PCAP back together in the mappers?

I think this could be demonstrated easily using a text file.

EntityInputFormat / EventInputFormat

This isn't hard with the wikisearch iterators, but it should be supplied

Create wiki pages detailing the use of each of the stores

Each of the pages should give a detailed description of the purpose of each type of store and any extensions that are available. Ideally there should be code examples showing how to use the API for the store.

Blob store iterators to include output stream (to allow gzipping)

Hadoop input/output formats for each of the stores

Feature Store

I've abstracted the current metrics store up a layer so that I could provide a generic store for features that have been extracted from entities. These features can include histogram of various data points used to calculate learning models, statistical summaries like the current metrics store, and other things.

The reason I wanted one set of Accumulo storage services to store various different types of features (especially on the same tablet) is so that I can have a server side iterator perform specialized analytics and correlations on different features about an entity before the data is brought back to a client.

In order to do this, I've added a new index. The current metrics store index (with group\0date in the row id) is nice when you don't expect to have too many unique types/names associated with the group. In reality, I found that having a large fabric of entities in which I'm extracting features (into the millions of unique type/name pairs) made even a batch scan over the metrics store for several metrics very slow.

The new index swaps the group and the type so that the type is in the row id and a batch scan through, say 72 hours of metrics for several group/type/name combinations can be pulled back instantly at lightening fast speeds (less than a second) with even millions of unique group/type/name combinations.

I've also stuck with keeping the index with the group in the rowId because it's much more efficient for doing batch scans (map/reduce) where I can slurp up an entire group of metrics by doing a simple range scan.

Authorizations class should be removed from all interfaces.

Investigate alternative approaches for grouping data in iterators

Currently there are two stores that group multiple values into a single json string value. The event store and the lastn store. See Issue #29 for the motivation behind doing this.

This ticket is to track any testing for alternative serialization mechanisms (such as kryo) and the use of compression.

There is an edge case where the rangestore returns duplicate data

Here is the test below

AccumuloRangeStore<Long> rangeStore = new AccumuloRangeStore<Long>(getConnector(), new LongRangeHelper());

rangeStore.save(singleton(new ValueRange<Long>(5L, 10L)));
rangeStore.save(singleton(new ValueRange<Long>(90L, 95L)));
rangeStore.save(singleton(new ValueRange<Long>(2L, 98L)));
rangeStore.save(singleton(new ValueRange<Long>(20L, 80L)));

//should return [2-98] and [20-80]
List<ValueRange<Long>> results = newArrayList(rangeStore.query(new ValueRange<Long>(49L, 51L), new Authorizations()));

//actually returns [20-80], [2-98], [20-80] because the forward and monster iterator both pick up 20-80
assertEquals(2, results.size());  //fail

Metrics service has some critical flaws

The primary root of the problem is that the Combiners mutate the value, but this is not accounted for later on. This is especially true with the query method that allows a specified function.

There are several problems with this method in particular. First, because the table is configured to run the combiners during compaction, this function method will be flawed unless another normalizer is configured for the data in the function. At that point there would be no need for the overloaded method using a function, as that information could simply be specified in the combiner class and iterator options parameters on the normalizer.

Second, if any function uses the results of one of the already preconfigured combiners (Summing and Stats), then it will only ever get one piece of information because the functionCombiner iterator is configured at a priority of 30. This means that the sums or stats will already be fully calculated before the function is called leaving nothing for the function to do. This just reiterates that if you want to specify iterators for the table, it needs to have knowledge about any functions before hand.

Additionally, the functions don't have a specific format that they output data (understandably), therefore a different normalizer is required simply to read the data, again defeating the purpose of the query with a function method.

I would suggest there are two options. Simply configure the table with static number of metric combinations and allow for iterators to be configured on the table (scan, minC, and maxC). Another implementation could have no table iterator configs and a way to specify a function which can be applied at scan time.

Tinkerpop Graph implementation for sharded Entity store

I've put a lot of work into plugging my Entity store into the Tinkerpop graph implementation as well as providing an edge table for fast breadth-first propagation through the graph.

Other notable features are Tinkerpop Gremlin shell initializers and Rexster support. It would be great to have this become a part of Calrissian's recipes.

Remove the "checkexists" check in the blobstore

This was a choice I made early on, and after having recently looked at that code I am not convinced the benefits outweight the pain. For reference this is a non public method that is used to check to see if a blob already exists before trying to save another blob.

Reasons for removal:

It uses the accumulo user's authorizations instead of the provided authorizations meaning information about the key and type are exposed even though though it was requested to use another set of authorizations. (Important for environments with auth managed access).
Even with this check there is no guarantee that the accumulo user configured in the connector has the authorizations to view the blob, defeating the point of the check.
It is an extra scan before every save, which while not alot may not actually provide the intended benefit given the other two points.

All stores should persist a version to each Accumulo table

This is easy and can be encapsulated in a utility method. It will help is with migrations down the road.

Proposed row format for version:

metadata version:1.0.0

AtomicEntityStore

One thing we've been wanting for quite some time is the ability to, in some cases, provide locking for entity updates and other entity mutations so that two users using clients on different systems do not end up with partially saved entities or entities that were updated by someone else while they were in the middle of an update themselves.

I've successfully implemented an AtomicEntityStore wrapper that will hold a mutex using Apache Zookeeper so that entity operations performed through the wrapper can be atomic. That is it can be guaranteed that no other entities of the same type and id will be updated while a different user holds the lock.

NotEq() Working?

NotEq() does not seem to be working. The below Junit test does not work. Am I looking at this correctly?

@test
public void NotInQuery() throws Exception, AccumuloException,
AccumuloSecurityException {

    StoreEntry event = new StoreEntry(UUID.randomUUID().toString(),
            currentTimeMillis());
    event.put(new Tuple("hasIp", "true", ""));
    event.put(new Tuple("ip", "1.1.1.1", ""));

    StoreEntry event2 = new StoreEntry(UUID.randomUUID().toString(),
            currentTimeMillis());
    event2.put(new Tuple("hasIp", "true", ""));
    event2.put(new Tuple("ip", "2.2.2.2", ""));

    StoreEntry event3 = new StoreEntry(UUID.randomUUID().toString(),
            currentTimeMillis());
    event3.put(new Tuple("hasIp", "true", ""));
    event3.put(new Tuple("ip", "3.3.3.3", ""));

    store.save(Collections.singleton(event));
    store.save(Collections.singleton(event2));
    store.save(Collections.singleton(event3));

    Node query = new QueryBuilder()
        .and()
            .notEq("ip", "1.1.1.1")
            .notEq("ip", "2.2.2.2")
            .notEq("ip", "4.4.4.4")
            .eq("hasIp", "true")
        .endStatement().build();

    Iterator<StoreEntry> itr = store.query(
            new Date(currentTimeMillis() - 5000), new Date(), query,
            new Auths()).iterator();

    int x = 0;

    while (itr.hasNext()) {
        x++;
        StoreEntry e = itr.next();
     assertEquals("3.3.3.3",e.get("ip").getValue());

    }
    assertEquals(1, x);
}

Implement an ext utility that can pull back some queried rows from the metrics store and calculate a normal dist function for all the rows in the query

The purpose of this is to answer questions like:

What's the normal distribution curve look like for the past 20 minutes?
What's the normal distribution curve look like for the last 12 hours?
What's the normal distribution curve look like for the last 15 days?
What's the normal distribution curve look like for the last 2 months?

Range store should have a more generic parameterized type that it supports.

Anything woth a start/stop that can be sorted and complemented should be able to be placed in the range store.

What is the fundamental purpose of these stores?

I am posing this question to try to get an understanding of what criteria we should be using to decide what goes into a store implementation and what is excluded. I think it will be futile to try and handle every possible usecase for anyone that needs, but I don't want to leave out features that could be useful.

My current understanding is that we provide a simple core implementation to meet the core needs of most usecases, but make it extensible for those with more advanced usecases so that they don't have to reinvent the wheel. And if and only if some of those advanced usecases are popular enough we provide an ext store implementation.

For example, the current implementation of the blobstore does the job of storing and querying byte data very effectively. I for one don't need any extra bells and whistles that the ExtendedBlobStore provides as I track that metadata in another table. The same is true for other types of metadata such as that being proposed in Issue #30.

Hopefully, answering this question will also address these questions:

Do we to try to throw all the bells and whistles into them even if they don't provide any core functionality to the store but meet some users needs?
Where do we want to draw the line for what features will go into a store implementation?
Should performance/storage be concerns for features that don't add to the core function of the store?
Do we want to keep making these stores extensible, or try to move more toward one implementation per store?

I know I brought up another issue, but my intent was to use it simply as an example, not to debate its merits here.

Make the monster iterator an Accumulo Iterator (Range Store)

Unlike the forward and reverse iterators which basically return everything and maintain a state, the monster iterator looks like it is simply filtering. If we implemented this as a filtering iterator most of this work could be done on the server side.

Additionally, we could then use a batch scanner.

The biggest difference between this and a normal filter would be the early exit check for if the current upperbound is less than the extremes lowerbound. This should be taken into account in either some intelligent range checking or by short cutting the hasTop() method.

Metrics support for cumulative data.

A current use case warrants putting continuously cumulative data into the metrics store. Right now, we only offer a "COUNT" as a necessary default. There are other calculations you can do on continuous datasets and we should explore them.

Use case: A system is up for 2 months. Every 3 minutes it beacons off how long it has been since a restart and how many GET requests it's apache server has encountered since it started up. Those are obviously fake metrics, but you get the point.

Implement range store iterator logic in separare iterator class.

The batch scanner/writer properties for thread count, memory buffer sizes, etc.. should all be settable.

I noticed this in the metrics store today- this is really huge for customizing a store to a specific environment.

Some people may have a ton of resources on their clouds and may be able to crank up the memory and # of threads.. these people may be more concerned with fast ingest. On the flip side, other people may be convervative with their resources and may have many different stores living side by side. They may want to conserve their threads & memory utilization and spread them across all the stores evently. They should be able to set these values.

Break up rangestore into two different implementations.

This is just a proposal to have one implementation that does not gather overlapping ranges and another that does (like the current implementation). The proposed implementation would follow the ext structure that we have used in other stores.

The reason for a second implementation instead of simply making it an option is that without having to worry about overlapping ranges, the store can handle more data types for example Strings.

Currently the problem for the rangestore is that it needs to be able to calculate the distance between two points to minimize the scan for the overlapping ranges. Without this calculation, ranges could be of any Comparable type instead of numerically backed types.

Dont use mutatable static configurations

This is a serious concern especially if you need two configurations in the same JVM.

One culprit of this is the MetricsContext which lets you configure normalizers. This is fine if you only have one configuration, but if there are two metrics services with different configurations then there is going to be a conflict.

Instead these configurations should be passed in or created locally.

[BlobStore] Checking if blob exists on insert should be optional

One of the major powerful features of Accumulo is the ability that integrity can be made optional in batch insert scenarios.

One initial use-case for this blob store (and the reason it was created) was to store geo image tiles for serving up through a WMS service. We batch inserted the files and the row ids we used were XYZ coordinates defining where the tile existed on the main map. I'd like to build an example that demonstrates this using bulk ingest both via the store and via map/reduce.

The checkExists() should be an optional parameter, maybe set through:
store.guaranteeIntegrity(true);

[BlobStore] InputFormat

Wrap the AccumuloInputFormat and provide an easy way to query blobs

Functional testing suite generic enough to stress out any store and give metrics on efficiencies and shortcomings

I'm thinking of the Accumulo random walk tests here... they use the random walk before every release to make sure they aren't missing anything and haven't broken anything in the process of adding features and fixing bugs.

calrissian / accumulo-recipes Goto Github PK

accumulo-recipes's People

Contributors

Stargazers

Watchers

Forkers

accumulo-recipes's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs