GithubHelp home page GithubHelp logo

cascading / cascading.hbase Goto Github PK

View Code? Open in Web Editor NEW

This project forked from cwensel/cascading.hbase

10.0 5.0 11.0 4.17 MB

HBase adapters for Cascading

Home Page: http://www.cascading.org/

Java 100.00%

cascading.hbase's Introduction

Welcome

This is the Cascading.HBase module.

It provides support for reading/writing data to/from an HBase cluster when bound to a Cascading data processing flow.

Cascading is a feature rich API for defining and executing complex, scale-free, and fault tolerant data processing workflows on a Hadoop cluster. It can be found at the following location:

http://www.cascading.org/

HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper on Bigtable.

http://hbase.apache.org/

History

This version has roots from the original Cascading.HBase effort by Chris Wensel, and then modified by Kurt Harriger to add the dynamic scheme, putting tuple fields into HBase columns, and vice versa. Twitter's Maple project also has roots from the original Cascading.HBase project, but is an update to Cascading 2.0. Maple lacks the dynamic scheme, so this project basically combines everything before it and updates to Cascading 2.2.x and HBase 0.94.x. It also adds support for lingual.

Building

This version could be built by using gradle:

> gradle build

If the tests are failing on your machine, do a umask 022 before starting the build.

Using

Hadoop 1 vs. Hadoop 2 vs. Tez

HBase provides two sets of dependencies since version 0.96, to work with the two different major version os hadoop. This builds supports both versions and creates two sets of jars.

If you are using a Hadoop distribution based on Apache Hadoop 1.x, you have to use cascading:cascading-hbase-hadoop:3.0.0-*. If you are using a Hadoop 2.x based distribution, you have to use cascading:cascading-hbase-hadoop2-mr1:3.0.0-+ as a dependency. If you use a Hadoop 2.x distribution with Apache Tez, you have to use cascading:cascading-hbase-hadoop2-tez:3.0.0-+

In cascading applications

Add the correct jar for your distribution to the classpath with your build tool of choice.

All jars are deployed on conjars.

See the HBaseDynamicTest and HBaseStaticTest unit tests for sample code on using the HBase taps, schemes and helpers in your Cascading application.

In lingual

This project also creates a lingual compliant provider jars, which can be used to talk to HBase via standard SQL. Below you can find a possible session with lingual and HBase.

# the hbase provider can be used with the hadoop or hadoop2-mr1 or hadoop2-tez platform
> export LINGUAL_PLATFORM=hadoop

# tell lingual, where the namenode, the jobtracker and the hbase zk quorum are
> export LINGUAL_CONFIG=fs.default.name=hdfs://master.local:9000,mapred.job.tracker=master.local:9001,hbase.zookeeper.quorum=hadoop1.local

First we install the provider, by downloading it from conjars.

> lingual catalog --provider --add cascading:cascading-hbase-hadoop:3.0.0-+:provider

or > lingual catalog --provider --add cascading:cascading-hbase-hadoop2-mr1:3.0.0-+:provider or > lingual catalog --provider --add cascading:cascading-hbase-hadoop2-tez:3.0.0-+:provider

Next we are creating a new schema called working to work with.

> lingual catalog --schema working --add

Now we add the hbase format from the hbase provider to the schema. We tell the provider, which column family, we want to work with. In this case, the family is called cf. Please note that there is a strict one-to-one mapping from HBase column families to lingual tables. If you want to access multiple column families in HBase from lingual, you can map them to different tables.

> lingual catalog --schema working --format hbase --add  --properties=family=cf --provider hbase

We register a new stereotype called hbtest with four fields: ROWKEY, A, B and C all of type string. These are the fields that will be used by the HBaseTap. The first field is always used as the rowkey in the table. All subsequent fields are used as qualifiers in a given column family (see above).

> lingual catalog --schema working --stereotype hbtest -add --columns ROWKEY,A,B,C --types string,string,string,string

Now we add the hbase protocol to the schema.

> lingual catalog --schema working --protocol hbase --add --provider hbase

Finally we create a lingua table called hb with the hbase provider. The table is called cascading in the HBase instance, so we use that as the identifier.

> lingual catalog --schema working --table hb --stereotype hbtest -add "cascading" --protocol hbase --format hbase --provider hbase

Now we can talk to the HBase table from lingual:

> lingual shell
(lingual)> select * from "working"."hb";
+---------+----+----+----+
| ROWKEY  | A  | B  | C  |
+---------+----+----+----+
+---------+----+----+----+

(lingual)> insert into "working"."hb" values ('42', 'one', 'two', 'three');
+-----------+
| ROWCOUNT  |
+-----------+
| 1         |
+-----------+

(lingual)> select * from "working"."hb";
+---------+------+------+--------+
| ROWKEY  |  A   |  B   |   C    |
+---------+------+------+--------+
| 42      | one  | two  | three  |
+---------+------+------+--------+

Limitations

As explained above only qualifiers from one column family can be mapped to the same table in lingual. You can still map multiple column families in the same HBase table to multiple tables in lingual.

Next to that, there is currently a limitation related to the casing of the qualifiers in a column family. Since lingual uses SQL semantics for column names, it tries to normalize them and uses upper case names. You can use lowercase names as well, but you might run into problems, when you do a select * from table style query. This limitation might be removed in future versions of lingual and therefore in the HBase provider. The easiest way to work around this limitation is using uppercase qualifiers.

Last but not least keep in mind that this provider gives you a SQL interface to HBase, but this interface is not meant for realtime queries. In the spirit of lingual it is meant as a SQL driven way for batch processing.

Types, Lingual, and HBase

The lingual provider takes a pragmatic approach to types, when it reads and writes to HBase. Since HBase has no type enforcement and there is no reliable way to guess the types, the provider converts every field to a String before storing it. The type conversion is done via the types on the Fields instance. If the type is a CoercibleType, the coerce method is called. When the data is read back, the is converted to its canonical representation, before it handed back to lingual. This makes sure that data written from lingual can be read back in lingual.

Other systems interacting with the same table need to take his behaviour into account.

Acknowledgements

The code contains contributions from the following authors:

  • Andre Kelpe
  • Brad Anderson
  • Chris K Wensel
  • Dave White
  • Dru Jensen
  • Jean-Daniel Cryans
  • JingYu Huang
  • Ken MacInnis
  • Kurt Harriger
  • matan
  • Ryan Rawson
  • Soren Macbeth and others

cascading.hbase's People

Contributors

boorad avatar cwensel avatar drujensen avatar fs111 avatar jdcryans avatar kcm avatar matant avatar ryanobjc avatar sorenmacbeth avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cascading.hbase's Issues

Same column name should be allowed to be used in different column families

HBase does not have any limit on the column names cross multiple column families. For instance, if you have column family 'cf1' and 'cf2', you can have the same column name in these column families, i.e. 'cf1:col1', 'cf2:col1'.

However, this is not allowed in cascading.hbase. If you try to create such scheme, cascading.hbase throws an exception. This should be allowed since it is a very basic usage.

cascading-hbase-hadoop2-mr1 depending on hadoop 2.2?

Hey,

when building with cascading-hbase-hadoop2-mr1 - 2.6 I'm getting hadoop 2.2 jarspulled in as dependency...is this expected?

Its causing runtime errors when I run against hadoop 2.5.2 (on the cascading vagrant vm)

cascading.hbase-hadoop2-tez: cascading.flow.FlowException- step failed

Hey,
I am using the cascading-hadoop2-tez with Hadoop2TezFlowConnector .
hadoop version :2.6.0,Hbase: 0.98.12.1-hadoop2,Tez:0.6.1
source and sink as HBaseTap
while running the code we are getting ClassNotFoundException: TableInputFormat.

2015-06-12 09:20:56,479 INFO [AsyncDispatcher event handler] history.HistoryEventHandler: [HISTORY][DAG:dag_1434080989680_0001_1][Event:DAG_FINISHED]: dagId=dag_1434080989680_0001_1, startTime=1434081056374, finishTime=1434081056448, timeTaken=74, status=FAILED, diagnostics=Vertex failed, vertexName=AA839B5CDFD948B1AEDAE569401AB883, vertexId=vertex_1434080989680_0001_1_00, diagnostics=[Vertex vertex_1434080989680_0001_1_00 [AA839B5CDFD948B1AEDAE569401AB883] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: 356812042F314BCF87D0FF9A81CD3D37 initializer failed, vertex=vertex_1434080989680_0001_1_00 [AA839B5CDFD948B1AEDAE569401AB883], org.apache.tez.dag.api.TezUncheckedException: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class cascading.hbase.helper.TableInputFormat not found
at org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateOldSplits(MRInputHelpers.java:426)
at org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateInputSplitsToMem(MRInputHelpers.java:295)
at org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:122)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:245)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:239)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:239)
at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:226)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class cascading.hbase.helper.TableInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2106)
at org.apache.hadoop.mapred.JobConf.getInputFormat(JobConf.java:689)
at org.apache.tez.mapreduce.hadoop.MRInputHelpers.generateOldSplits(MRInputHelpers.java:424)
... 13 more
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class cascading.hbase.helper.TableInputFormat not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2098)
... 15 more
Caused by: java.lang.ClassNotFoundException: Class cascading.hbase.helper.TableInputFormat not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1980)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2072)
... 16 more

It worked for HadoopFlowConnector, but it is not working when we installed Tez and changed flow connector to Hadoop2TezFlowConnector and it is working for HFS Tap.

Please help us in solving this issue..

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.