cphylabs / hbasewd Goto Github PK
View Code? Open in Web Editor NEWThis project forked from ionutig/hbasewd
HBase Writes Distributor
Home Page: http://sematext.com/open-source/index.html
License: Apache License 2.0
This project forked from ionutig/hbasewd
HBase Writes Distributor
Home Page: http://sematext.com/open-source/index.html
License: Apache License 2.0
HBaseWD: --------- http://github.com/sematext/HBaseWD April 2011 Released under Apache License 2.0. Author: ------- Alex Baranau Description: ------------ HBaseWD stands for Distributing (sequential) Writes. It was inspired by discussions on HBase mailing lists around the problem of choosing between: * writing records with sequential row keys (e.g. time-series data with row key built based on ts) * using random unique IDs for records First approach makes possible to perform fast range scans with help of setting start/stop keys on Scanner, but creates single region server hot-spotting problem upon writing data (as row keys go in sequence all records end up written into a single region at a time). Second approach aims for fastest writing performance by distributing new records over random regions but makes not possible doing fast range scans against written data. The suggested approach stays in the middle of the two above and proved to perform well by distributing records over the cluster during writing data while allowing range scans over it. HBaseWD provides very simple API to work with which makes it perfect to use with existing code. Please refer to unit-tests for lib usage info as they aimed to act as example. For a clear introductory post read: http://blog.sematext.com/2012/04/09/hbasewd-avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ Brief Usage Info (Examples): ---------------------------- Distributing records with sequential keys which are being written in up to Byte.MAX_VALUE buckets: byte bucketsCount = (byte) 32; // distributing into 32 buckets RowKeyDistributor keyDistributor = new RowKeyDistributorByOneBytePrefix(bucketsCount); for (int i = 0; i < 100; i++) { Put put = new Put(keyDistributor.getDistributedKey(originalKey)); ... // add values hTable.put(put); } Performing a range scan over written data (internally <bucketsCount> scanners executed): Scan scan = new Scan(startKey, stopKey); ResultScanner rs = DistributedScanner.create(hTable, scan, keyDistributor); for (Result current : rs) { ... } Performing mapreduce job over written data chunk specified by Scan: Configuration conf = HBaseConfiguration.create(); Job job = new Job(conf, "testMapreduceJob"); Scan scan = new Scan(startKey, stopKey); TableMapReduceUtil.initTableMapperJob("table", scan, RowCounterMapper.class, ImmutableBytesWritable.class, Result.class, job); // Substituting standard TableInputFormat which was set in // TableMapReduceUtil.initTableMapperJob(...) job.setInputFormatClass(WdTableInputFormat.class); keyDistributor.addInfo(job.getConfiguration()); Another useful RowKeyDistributor is RowKeyDistributorByHashPrefix. Please see example below. It will creates "distributed key" based on original key value so that later when you have original key and want to update the record you can calculate distributed key without roundtrip to HBase. AbstractRowKeyDistributor keyDistributor = new RowKeyDistributorByHashPrefix( new RowKeyDistributorByHashPrefix.OneByteSimpleHash(15)); You can use your own hashing logic here by implementing simple interface: public static interface Hasher extends Parametrizable { byte[] getHashPrefix(byte[] originalKey); byte[][] getAllPossiblePrefixes(); } Extending Row Keys Distributing Patterns: ----------------------------------------- HBaseWD is designed to be flexible and to support custom row key distribution approaches. To define custom row key distributing logic just implement AbstractRowKeyDistributor abstract class which is really very simple: public abstract class AbstractRowKeyDistributor implements Parametrizable { public abstract byte[] getDistributedKey(byte[] originalKey); public abstract byte[] getOriginalKey(byte[] adjustedKey); public abstract byte[][] getAllDistributedKeys(byte[] originalKey); ... // some utility methods } Mailing List: ------------- To participate more in the discussion, join the group at https://groups.google.com/group/hbasewd/ Build Notes: ------------ Current pom configured to build against HBase 0.89.20100924. HBase jars are provided with sources, as only HBase trunk sources were mavenized and available in public repos. Tests take some time to execute (can be up to several minutes), to skip their execution use -Dmaven.skip.tests=true. HBase Version Compatibility: ---------------------------- Compatible with HBase 0.20.5 and higher. May be compatible with previous: tests needed (TODO)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.